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(columns),  computed  using  (35).  The  eqi  pel  values  were  truncated  to  the  maximum  values 
50  and  5  respectively.  The  mean  and  standard  deviation  of  the  errors  in  pixels  (36),  over 
all  cameras  are  shown  below  each  plot.  Notice  that  each  plot  uses  a  different  scale  for 
better  visualization  of  errors.  Color  bars  shown  to  the  right . 66 

43  Evaluation  of  camera  parameters  using  EEE  measure  after  using  MapTK  for  the  ABQ215 
aerial  WAMI  dataset.  Left:  EEE  (error)  plot.  The  pel  in  the  matrix,  eqi,  indicates  the  error 
between  the  q- th  camera  (matrix  rows)  and  Z-th  camera  (columns),  computed  using  (35). 

The  eqi  pel  values  were  truncated  to  the  maximum  value  of  3.  The  mean  and  standard 


deviation  of  the  errors  in  pixels  (36),  over  all  cameras  are  fie  =  2.31  and  cre  =  2.  Right: 
Histogram  of  the  corresponding  error  values . 67 


44  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  uncorrected  metadata, 

refined  metadata  by  BA4S  and  VisualSfM.  Some  ground-truth  points  between  two  pairs  of 
cameras,  (#50, #150)  and  (#1,#158),  in  the  ABQ215  WAMI  dataset  and  a  pair  of  cameras 
(#50,#  150)  in  Berkeley  WAMI  dataset  were  used.  The  shown  images  correspond  to  the 
left  camera  in  each  pair.  On  each  shown  image  the  corresponding  ground-truth  point  is 
indicated  by  red  circle.  Epipolar  lines  corresponding  to  the  ground-truth  point  from  the 
right  camera  in  each  pair  are  directly  calculated  using  camera  parameters  (using  Eq.  (28)) 
and  plotted  on  the  image  of  the  left  camera  in  each  pair  (shown  images).  The  camera  pa¬ 
rameters  from  three  different  sources  were  used  in  each  column;  left:  uncorrected  original 
metadata,  middle:  BA4S  (refined  metadata)  and  right:  VisualSFM . 67 

45  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  the  metadata  before 
and  after  refinement  by  BA4S.  The  left  and  right  (cropped)  image  in  each  row  correspond 
to  the  camera  parameters  assessments  using  original  metadata  (uncorrected)  and  BA4S 
optimized  ones,  respectively.  While  Fig.  44  depicted  the  epipolar  lines  for  the  point  cor¬ 
respondences  between  just  two  views,  each  row  in  the  current  figure  represents  a  similar 
assessment  however  between  one  camera  and  all  other  cameras  within  the  dataset.  The 
groundtruth  point  is  marked  in  yellow.  Instead  of  drawing  all  epipolar  lines  over  an  im¬ 
age,  an  analogue  point  for  each  epipolar  line  is  plotted  (colored  in  red).  We  obtain  the  line 
segment  which  joins  the  ground  truth  point  to  the  epipolar  line  and  is  perpendicular  to  the 
line.  Each  of  such  a  line  segment  is  used  as  an  error  vector.  The  foot  of  the  perpendicular 
corresponding  to  each  epipolar  is  plotted  as  red  points.  The  mean  and  covariance  matrix  of 
the  error  vectors  are  computed  and  the  values  (in  pixels)  are  printed  over  the  top  of  each 
image  in  the  figure.  The  covariance  matrix  and  mean  are  also  plotted  as  an  ellipse  and 
point  (in  blue),  respectively.  In  these  two  samples,  we  observed  reductions  in  the  errors  by 
factors  of  more  than  70  and  140  times  respectively  for  Albuquerque  and  Berkeley  datasets.  68 
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46  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  the  corrected  metadata 
by  BA4S  and  VSFM.  Visual  assessment  of  camera  parameters  using  epipolar  ellipses  in 
ABQ215  dataset  (frame  #215).  Left:  VSFM,  right:  BA4S.  While  Fig.  44  depicted  the 
epipolar  lines  for  the  point  correspondences  between  just  two  views,  the  current  figure 
represents  a  similar  assessment  however  between  one  camera  and  all  other  cameras  within 
the  dataset.  The  ground  truth  point  is  marked  in  yellow.  Instead  of  drawing  all  epipolar 
lines  over  an  image,  an  analogue  point  for  each  epipolar  line  is  plotted  (colored  in  red). 

We  obtain  the  line  segment  which  joins  the  ground  truth  point  to  the  epipolar  line  and  is 
perpendicular  to  the  line.  Each  of  such  a  line  segment  is  used  as  an  error  vector.  The  foot 
of  the  perpendicular  corresponding  to  each  epipolar  is  plotted  as  red  points.  The  mean  and 
covariance  matrix  of  the  error  vectors  are  computed  and  the  values  (in  pixels)  are  printed 
over  the  top  of  each  image  in  the  figure.  The  covariance  matrix  and  mean  are  also  plotted 
as  an  ellipse  and  point  (in  blue),  respectively.  The  Euclidean  error  corresponding  to  this 
sample  is  2.02  and  0.82  pixels  for  VSFM  and  BA4S  respectively.  Notice  that  the  red  error 
points  are  all  scattered  very  close  to  the  ground  truth  (yellow  point)  and  therefore  may  not 

be  visible,  specially  for  the  BA4S  result  as  the  error  is  very  small . 69 

47  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  uncorrected  metadata, 
refined  metadata  by  BA4S  and  MapTK.  A  ground-truth  point  between  a  pair  of  cameras 
(#1,#158)  in  the  ABQ215  WAMI  dataset  is  used.  The  shown  images  correspond  to  the 
left  camera  in  each  pair.  On  each  shown  image  the  corresponding  ground-truth  point  is 
indicated  by  red  circle.  Epipolar  lines  corresponding  to  the  ground-truth  point  from  the 
right  camera  in  each  pair  are  directly  calculated  using  camera  parameters  (using  Eq.  (28)) 
and  plotted  on  the  image  of  the  left  camera  in  each  pair  (shown  images).  The  camera 
parameters  from  three  different  sources  were  used  in  each  column;  left:  uncorrected  orig¬ 
inal  metadata,  middle:  BA4S  (refined  metadata)  and  right:  MapTK  [122].  As  one  can 
see  there  are  some  pixel  errors  between  the  ground  truth  point  and  corresponding  epipolar 


line  computed  from  MapTK  result,  which  is  consistent  with  the  errors  in  plot  43-left  in 
the  region  for  the  corresponding  pair  (between  the  pair  #1,#158) . 70 

48  Dense  3D  point  clouds  for  four  aerial  WAMI  sequences  using  CMVS  [81]  with  BA4S 

corrected  camera  pose  metadata.  Source  imagery  from  Transparent  Sky . 71 

49  Position  errors  before  (red  curve)  and  after  (green  curve)  optimization  using  BA4S  to 

refine  the  metadata  for  DinoRing  (top)  and  Fountain-Pi  1  (bottom)  datasets . 72 

50  Visual  assessment  of  camera  parameters  using  epipolar  lines  in  DinoRing  (first  three  rows) 
and  Fountain-Pll  (last  three  rows)  datasets.  Each  row  shows  the  results  for  one  corre¬ 


spondence.  First  and  second  columns  are  using  the  original  (perturbed)  camera  metadata 


parameters.  Notice  that  in  (b),  (f),  (n),  (r)  and  (v),  the  epipolar  line  went  off  the  image 
plane  due  to  large  metadata  error.  Third  and  fourth  columns  of  each  row  show  the  full  and 

zoomed  views  using  camera  parameters  after  BA4S . 73 

5 1  Dense  reconstruction  using  PMVS  after  optimized  camera  parameters  are  estimated  using 

BA4S.  Left:  DinoRing,  right:  FountainPll . 74 


52  Registration  of  frames  #0  and  #60  in  ABQ  WAMI  dataset  for  two  cases  of  uncorrected 
metadata  and  corrected  by  the  proposed  BA4S.  For  illustration,  eight  points  within  these 
cropped  registered  images  were  manually  tracked.  The  green  markers  indicate  the  correct 
position  of  the  points  (ground  truth)  and  the  red  ones  indicate  the  corresponding  points 
from  the  other  frame  after  overlay.  The  average  RMS  error  for  the  registered  frames  using 
the  original  metadata  (uncorrected)  and  using  corrected  metadata  (by  BA4S)  are  about  70 
and  1.1  pixels,  respectively . 75 
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53  Overlays  of  two  registered  frames  (#0  and  #60)  in  ABQ  WAMI  dataset  for  two  cases  of 

uncorrected  metadata  and  corrected  by  the  proposed  BA4S.  The  corresponding  registered 
frames  were  already  shown  in  Fig.  52.  As  depicted  on  the  right  plots,  the  points  lying 
on  the  ground  plane  from  the  two  different  frames  are  precisely  coincided  after  BA4S 
registration.  Pixels  off  the  ground  plane  like  the  buildings  are  wobbled  due  to  the  parallax 
effects  explained  in  Fig.  7 . 76 

54  Illustration  of  motion  detection  using  Flux  trace  only:  From  left  to  right,  cropped  ROI 
of  Albuquerque  aerial  imagery  (/riooX  the  spatio-temporal  motion  information  computed 
by  flux  trace  for  the  selected  image,  flux  mask  in  which  each  pixel  is  classified  as  moving 

or  stationary  by  thresholding  the  flux  trace.  Morphology  is  applied  to  improve  the  result.  .  77 

55  Illustration  of  motion  detection  using  Flux  trace  and  Depth . 77 

56  Illustration  of  motion  detection  using  Flux  trace+Depth+GAC  . 78 

57  Object-level  Precision  and  Recall:  Performance  evaluation  of  our  proposed  fused  motion  detection 

method . 78 

58  Object-Level  Fmeasure:  Performance  evaluation  of  our  proposed  fused  motion  detection  method  .  78 

59  Color  versus  intensity,  (a)  Tracked  objects  and  their  surroundings  have  different  color  features;  (b) 
Tracked  objects  and  their  surroundings  have  similar  color  but  different  intensity  features.  Columns 
left  to  right:  original  image,  likelihood  map  obtained  using  color  names  (CN)  feature,  likelihood 

map  obtained  using  intensity  feature . 79 

60  Frame  level  max-pooling  best  scale  selection  versus  pixel-based  max-pooling.  The  third 

column:  the  frame  level  C *  over  all  scales,  while  the  sixth  column:  jCmax  pixel-based 
max-pooling  result . 79 

61  Performance  evaluation  on  VOT2015  dataset:  (a)  robustness,  (b)  accuracy.  Sequences  reordered 

by  robustness  and  accuracy  values  respectively . 80 

62  Robustness  comparison  between  CN,  extended  CN,  LoFT,  and  CS-LoFT  trackers . 80 

63  Sample  KC-LoFT  tracking  results  on  ABQ  aerial  surveillance  dataset . 82 
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65  Comparison  of  tracking  results  in  terms  of  robustness.  Trajectory  color  represents  num¬ 

ber  of  reinitializations  after  tracker  failures.  Lower  number  of  trajectory  colors  indicates 
higher  tracker  robustness . 84 

66  Contribution  of  each  tracker  component.  Left:  PR-MOTA  metric  for  each  experiment 

compared  to  the  Full  Mode  (higher  is  better).  Right:  description  of  each  experiment.  ...  85 
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1  Summary 


The  goal  on  this  effort,  as  a  part  of  the  AFRL  ECP,  was  to  take  the  computation  closer  to  the  sensor.  Our 
focus  was  on  feature  detection  and  matching  techniques  along  the  video  sequence,  developing  efficient 
structure-from-motion  and  bundle  adjustment  techniques  for  high  resolution  aerial  imagery,  designing  3D 
reconstruction  algorithm  to  run  on  metropolitan- scale  data,  analytical  image  stabilization  and  georegistra¬ 
tion,  and  developing  moving  object  detection  and  tracking  algorithms. 

Processing  close  to  the  sensor  is  becoming  essential  given  the  large  image  sensor  data  volumes  for 
maximum  endurance,  coordinated  persistent  surveillance  within  contested  environments  and  intelligent 
decision  making  using  autonomous  platforms.  The  creation  of  3D  models  from  airborne  wide  area  motion 
imagery  (WAMI)  will  facilitate  a  number  of  capabilities  including  improved  geo-registration,  increased 
reliability  of  target  tracking  through  occlusions  and  shadows,  onboard  processing  for  video  object  analysis, 
better  modeling  of  object  motion  dynamics,  and  enable  high  compression  of  large  scale  scene  imagery 
for  efficient  real  time  downlinks.  This  ECP  accomplished  the  previous  work  on  creating  a  robust  3D 
capability  using  new  scalable  multiview  reconstruction  algorithms.  Scalability  to  large  regions  and  higher 
resolution  point  clouds,  improved  accuracy  of  structures  and  parallelization  for  a  factor  of  ten  reduction  in 
computational  speed  were  the  focus  of  this  ECP  effort,  which  has  been  successfully  achieved.  Automatic 
moving  object  detection  and  segmentation  is  one  of  the  fundamental  low-level  tasks  for  many  of  the 
WAMI  surveillance  systems  including  traffic  monitoring,  activity  and  behavior  recognition  and  tracking 
applications.  We  developed  an  automatic  moving  vehicle  detection  system  for  wide  area  aerial  imagery 
based  on  semantic  fusion  of  flux  trace  and  building  mask  information.  An  average  precision  of  90%  and 
recall  of  80%  have  been  reported  for  200,  2k  x  2k  cropped  region  of  interest  from  Albuquerque  urban 
imagery. 

This  effort  fulfilled  the  proposed  three  directions  of  research  defined  in  the  ECP  extension  within 
scope  of  the  original  proposal:  a)  Improved  accuracy  of  multiview  3D  reconstruction  of  urban  scenes 
using  a  volumetric  multipass  voting  algorithm  that  uses  bundle  adjustment  to  ensure  accurate  camera  pose 
and  position  information,  uniformly  dense  feature  extraction  for  sufficient  vote  accumulation,  appropriate 
temporal  sampling,  point  cloud  extraction  from  the  vote  volume,  vertex  coloring  and  post  processing 
to  estimate  accurate  surface  geometries,  b)  Stream  processing  to  enable  integration  of  video  analysis 
algorithms  developed  for  object  motion  analysis  under  the  CETE  project,  c)  Parallelization  of  components 
of  the  visual  exploitation  algorithms  including  video  motion  analysis,  3D  reconstruction.  The  flow  of 
processing  for  the  3D  reconstruction  algorithm  is  shown  in  the  system  diagram  below  with  the  major 
modules  identified.  Our  pipeline  in  this  effort  was  a  novel  approach  to  recovering  3D  structure  of  urban 
scenes  with  dense  buildings  using  a  3D  multipass  voxel  voting  method.  Some  intermediate  stages  of 
the  pipeline  including  sensor  measurements,  bundle  adjustment,  voting,  post  processing  and  point  cloud 
coloring  and  rendering  are  shown  below. 

2  Introduction 

Processing  close  to  the  sensor  is  becoming  essential  given  the  large  image  sensor  data  volumes  for  max¬ 
imum  endurance,  coordinated  persistent  surveillance  within  contested  environments  and  intelligent  deci¬ 
sion  making  using  autonomous  platforms.  The  creation  of  3D  models  from  airborne  wide  area  motion 
imagery  (WAMI)  will  facilitate  a  number  of  capabilities  including  improved  geo-registration,  increased 
reliability  of  target  tracking  through  occlusions  and  shadows,  onboard  processing  for  video  object  anal¬ 
ysis,  better  modeling  of  object  motion  dynamics,  and  enable  high  compression  of  large  scale  scene  im¬ 
agery  for  efficient  real  time  downlinks.  This  technical  report  presents  our  endeavors  within  the  scope  of 
this  proposal  on  creating  a  robust  3D  capability  using  new  scalable  multiview  reconstruction  algorithms. 
Parallelized  visual  processing  algorithms  for  heterogeneous  multi-core  architectures  will  provide  a  path 
to  achieve  real  time  requirements  for  the  computationally  intensive  3D  reconstruction  tasks.  Under  the 
current  project  we  have  successfully  demonstrated  3D  reconstruction  of  several  urban  areas  including  Al¬ 
buquerque,  Los  Angeles,  Berkeley,  Columbia,  Coit  Tower/San  Francisco  and  Four  Hills.  We  have  also 
shown  a  great  achievement  in  improving  the  accuracy  of  camera  metadata  using  a  fast  and  robust  structure 
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Figure  1:  The  3D  processing  pipeline  flowchart  showing  key  modules. 


from  motion  and  bundle  adjustment  algorithm.  By  fusion  of  depth  masks  of  the  scene  obtained  from  3D 
reconstruction  and  a  flux  tensor  based  detection  method,  we  have  drastically  improved  our  multi  target 
tracking  algorithm.  Below  is  the  list  of  our  deliverable  on  this  effort. 

1.  Bundle  adjustment  algorithm  to  improve  noisy  camera  metadata  that  exploits  the  availability  of  ini¬ 
tial  metadata  from  on-board  sensors,  WAMI  flightpath  geometry  to  ensure  that  rays  across  multiple 
views  for  the  same  feature  will  converge  appropriately  using  co-visibility  constraints.  (ACCOMPLISHED) 

2.  Multi-pass  voxel-based  voting  algorithm  with  a  tradeoff  between  the  size  of  the  voxel  space  and  the 
cost  of  computation  and  memory  requirements  for  a  given  level  of  3D  reconstruction  fidelity  that  is 
scalable  to  large  scenes.  (Accomplished) 

3.  Parallelizing  critical  parts  of  the  overall  3D  processing  pipeline  using  a  combination  of  methods  in¬ 
cluding  converting  compute  intensive  functions  from  Matlab  to  C++,  using  multithreading,  porting 
to  parallel  GPU  implementations  as  necessary.  (ACCOMPLISHED) 

4.  Reducing  the  current  total  time  taken  from  about  100  hours  for  reconstructing  the  downtown  LA 
region  using  a  Matlab  implementation  to  10  hours  using  a  mixture  of  Matlab,  C++  and  GPU  CUDA. 
(Accomplished) 

5.  Extension  of  implementation  of  multiscale  structure  tensor  point  feature  extraction  and  persistent 
feature  alignment  for  improved  positional  accuracy  and  to  incorporate  linear  features.  (ACCOMPLISHED) 

6.  Vote  refinement  algorithm  to  improve  the  quality  of  the  point  cloud  by  removing  noisy  3D  vertices 
or  outlier  voxels,  fill  in  gaps  and  holes  in  surfaces,  plane  fitting  and  smoothly  interpolate  across 
curved  surfaces.  (Accomplished) 

7.  Point  cloud  coloring  algorithm  using  information  about  local  spatially  consistent  patches  in  the  3D 
scene  obtained  by  surface  clustering  and  normal  estimation.  (ACCOMPLISHED) 

8.  Interactive  visualization  tool  called  Nimbus  or  Stratus  for  viewing  point  clouds  and  their  geometric 
properties.  (ACCOMPLISHED) 

9.  Visual  motion  analysis  algorithms  including  LOFT  and  CSURF  for  transition  to  AFRL.  (ACCOMPLISHED) 

10.  Many  core  parallel  implementation  of  some  of  the  computationally  intensive  modules  such  as  fea¬ 
ture  extraction,  bundle  adjustment,  3D  voting  and  point  cloud  processing.  (ACCOMPLISHED) 


12 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


1 1 .  Collaboration  with  Transparent  Sky  if  subcontract  funds  are  available  for  improved  geo-registration, 
support  for  high  dynamic  range  image  acquisition  and  processing  to  deal  with  low  contrast  and 
shadow  regions,  and  Bayesian  super-resolution  to  improve  target  tracking.  (ACCOMPLISHED) 

12.  Documentation  for  technical  work  completed  and  information  gained  as  part  of  this  ECP  project. 
Provide  a  description  of  important  observations,  nature  of  problems,  positive  and  negative  results, 
and  various  design  criteria  established,  procedures  followed,  or  lessons  learned. 

13.  Attend  and  present  at  technical  review  meetings  for  the  overall  CETE  program. 

14.  Deliver  software  programs  developed,  assembled,  or  acquired  to  accomplish  the  video  based  3D 
reconstruction  algorithm. 


3  Methods,  Assumptions  and  Procedures 

3.1  Bundle  Adjustment  (BA)  for  Camera  Metadata 

The  use  of  Global  Positioning  System  (GPS)  and  Inertial  Measurement  Unit  (IMU)  sensors  to  track  the 
3D  path  of  platforms  and  cameras  is  becoming  more  widely  available  and  is  routinely  used  in  aerial 
navigation  and  imaging  [38]  as  well  as  hand  held  navigation  devices  [114,  4,  134].  The  camera  path  and 
pose  information  is  used  to  support  robust,  real-time  3D  scene  reconstruction  using  Structure-from-Motion 
algorithms  (SfM)  [104,  174,  184,  17,  14,  114,  23,  15,  18,  1<  ]  and  direct  georegistration.  Irschara  et  al.  in 
[104]  observe  that,  "These  systems  rely  on  highly  accurate  geo-referencing  devices  that  are  calibrated  and 
the  delivered  pose  and  orientation  estimates  are  often  superior  to  the  one  obtained  by  image  based  methods 
(i.e.  subpixel  accurate  image  registration)".  However,  many  (inexpensive)  aerial  platforms  produce  IMU 
and  GPS  values  of  limited  accuracy  due  to  measurement  and  timing  errors  which  then  need  to  be  refined 
for  accurate  SfM  [104].  Extracting  and  incorporating  3D  information  in  Wide  Area  Motion  Imagery 
(WAMI)  processing  [160,  39]  will  be  very  useful  for  mitigating  parallax  effects  in  video  summarization 
[208,  209],  better  stabilization  and  appearance  models  for  tracking  [155],  depth  map  filtering  of  motion 
detections  [175,  15  ],  and  improving  video  analytics  like  object  tracking  [169,  16(  ].  In  this  project  a  fast 
and  robust  camera  pose  refinement  and  SfM  method  is  proposed  for  sequential  aerial  imagery  applicable 
for  reliable  georegistration  and  vehicle  tracking  tasks.  The  data  flow  of  the  pipeline  is  depicted  in  Fig.  2. 
The  pipeline  receives  2D  aerial  images  which  were  sequentially  acquired  as  well  as  approximate  (noisy) 
pose  information  from  on-board  sensors  that  provide  location  (using  a  GPS)  and  orientation  (using  an 
IMU).  Point  of  interests  (features)  are  extracted  in  each  image.  The  extracted  features  from  each  two 
adjacent  frames  within  the  sequence  (starting  from  the  first  frame)  are  compared  against  each  other.  Going 
towards  the  next  frames,  sequentially  matched  features  along  the  sequence  are  used  to  build  tracks  of 
interest  points.  2D  feature  points  in  each  track  are  triangulated  (linear  method  [95])  using  the  noisy 
camera  parameters  (from  metadata)  to  estimate  a  corresponding  3D  point  in  the  scene.  A  non-linear 
optimization  method  (Levenberg-Marquardt)  is  used  to  jointly  refine  the  camera  parameters  and  estimated 
3D  points.  In  contrast  to  other  existing  methods,  the  noisy  metadata  are  directly  used  as  the  main  source 
for  initializing  the  camera  poses  without  estimating  them  through  (or  combining  them  with)  image-based 
approaches  such  as  the  five-point  algorithm  [  52,  102,  8].  Outliers  or  mismatch  errors  can  adversely  affect 
the  non-linear  optimization  approaches  and  in  order  to  mitigate  their  impacts  a  robust  function  is  proposed 
which  is  adaptive  with  respect  to  the  persistency  of  each  tracked  feature.  The  proposed  approach  neither 
needs  to  estimate  camera  parameters  nor  to  classify  inliers/outliers  in  early  stages  of  the  pipeline.  As  a 
direct  consequence,  consensus  combinatorial  methods  (RANSAC  or  its  variations-  see  Fig.  3)  are  avoided 
which  so  far  have  been  a  core  part  of  most  SfM  and  georegistration  pipelines  to  combat  measurement 
noise,  outliers  and  mismatch  errors  in  point  correspondences  while  estimating  the  camera  parameters. 
Fig.  3  depicts  the  pipeline  of  a  conventional  SfM,  while  our  proposed  pipeline  is  shown  in  Fig.  4. 
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Georegistered  images  using  BA4S 


i 


Applications 


Dense  3D  point  cloud  Elevation/depth  map 


Moving  vehicle  detection  and  tracking  using  stabilized  images 


Figure  2:  Overall  view  (data  flow)  of  the  proposed  pipeline.  Raw  ultra  high  resolution  images  (6,600x4,400  pixels)  together  with  noisy  metadata, 
acquired  from  an  airborne  platform  flying  over  Albuquerque-NM  downtown,  are  the  inputs  to  the  pipeline.  BA4S  processes  them  in  a  very  short 
amount  of  time  (in  this  case  less  than  12  minutes  for  1071  images).  The  BA4S’s  outputs  are  the  refined  camera  parameters  and  sparse  point 
cloud.  Georegistration  is  performed  using  the  proposed  analytical  homographies.  Some  applications  for  the  BA4S  are  shown  in  the  bottom  box: 
BA4S’s  outputs  are  used  to  produce  a  dense  point  cloud  (using  PMVS[81]).  The  georegistered  images  are  then  used  together  with  available  3D 
information  (depth  of  the  buildings  from  the  mentioned  dense  point  cloud)  to  perform  moving  vehicle  detection  and  tracking [175,  li  ]. 
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Figure  3:  A  conventional  SfM  pipeline.  The  acronyms  RSS,  ME  and  OE  used  in  this  figure  stand  for  Random  Subset  Selection ,  Model  Estimation 
and  Outlier  Elimination,  respectively.  In  the  conventional  methods,  iterative  RANSAC  (or  its  variations)  is  inevitable  which  includes  randomness 
and  time  consumption. 

3.1.1  Related  Work 

In  the  computer  vision  community  the  camera  parameters  (pose)  are  referred  to  as  intrinsic  and  extrin¬ 
sic *,  while  in  photogrammetry,  these  same  metadata  are  known  as  interior  and  exterior  parameters  and 
the  estimation  process  is  referred  to  as  resectioning.  Precise  estimates  of  these  parameters  are  critical 
for  practical  computer  vision  applications  (particularly  dense  3D  reconstruction)  and  accurate  aerial  pho¬ 
togrammetry.  BA,  considered  as  the  gold  standard  for  refinement  of  camera  metadata  [197,  136,  95,  223], 
is  a  classical  and  well  studied  problem  in  computer  vision  and  photogrammetry  dating  back  more  than 
three  decades  [132,  203,  95].  BA  uses  initial  estimates  of  camera  poses  to  improve  the  metadata  by  mini¬ 
mizing  reprojection  errors  [64, 147, 203],  but  is  computationally  expensive  having  a  general  computational 
complexity  of  0((n  +  m)3)  and  memory  requirement  of  0(mn(m  +  n))  for  n  cameras  and  m  scene  3D 
points  [140,  79].  Conventional  BA  shows  satisfactory  convergence  if  sufficiently  accurate  initial  estimates 
are  provided  either  from  image  feature  matching-based  essential  matrix  estimation  [102,  ]  or  combined 
with  on-board  sensor  measurements  [124,  78,  184].  A  comprehensive  introduction  to  BA  in  [203]  covers 
a  wide  spectrum  of  methods  and  issues.  Due  to  the  recent  surge  in  developing  large  scale  3D  reconstruc¬ 
tions  using  unorganized  internet  photos,  smartphones  as  well  as  organized  or  sequential  aerial  imagery, 
there  has  been  renewed  interest  in  making  BA  more  robust,  scalable  and  accurate  [10,  106,  118,  102,  188]. 
Recent  methods  include  Sparse  BA  [133,  1 12,  63],  Incremental  BA  [124]  and  Parallel  BA  [217,  215].  Sev¬ 
eral  optimization  methods  for  BA  are  compared  in  [106]  and  the  conjugate  gradient  approach  is  shown  to 
produce  better  results  in  terms  of  speed  and  convergence.  A  SfM  method,  called  Mavmap ,  is  proposed  in 
[184]  which  leverages  the  temporal  consistency  of  aerial  images  and  availability  of  metadata  to  speed  up 
the  performance.  Recently,  MapTK  is  introduced  in  [122]  as  an  open  source  SfM  specially  designed  for 
aerial  video  which  is  able  to  incorporate  the  metadata  in  the  pipeline.  As  we  will  present  in  the  experiment 
section,  MapTK  is  not  robust  enough  and  runs  slower  than  our  proposed  pipeline  with  a  lower  accuracy, 
as  it  follows  the  conventional  SfM  pipeline.  In  [184],  VisualSFM/Bundler  [217]  was  considered  as  the 
most  advanced  and  widely  used  system  for  automated  and  efficient  3D  reconstruction  from  2D  images. 
However,  as  stated  in  [184],  it  is  not  efficient  for  aerial  imagery  and  also  has  no  integration  of  IMU  priors. 
In  [104],  the  authors  used  a  view  selection  strategy  to  speedup  SfM  but  had  limited  success  using  a  robust 
BA,  "Through  our  experiments,  robust  bundle  adjustment  was  not  able  to  converge  to  a  true  solution  from 
raw  IMU  initialized  projection  matrices".  Although  the  authors  proposed  a  robust  BA  and  tested  it,  their 
approach  was  not  sufficient  to  improve  the  camera  parameters  when  raw  metadata  was  used.  In  our  work, 
we  successfully  used  our  proposed  robust  BA  (BA4S)  to  refine  inaccurate  camera  parameters.  BA4S  pro¬ 
poses  a  different  SfM  pipeline  and  camera  motion  model  that  enables  inaccurate  sensor  measurements 
to  be  directly  used  as  initial  values  for  BA  optimization  and  without  any  early-stage/explicit  feature  mis¬ 
match  filtering  (i.e.  no  RANSAC  or  its  variants).  Many  approaches  to  use  GPS  and  IMU  measurements 
for  refining  camera  parameters  have  been  proposed,  especially  in  the  robotics  community.  However,  GPS 
and  IMU  measurements  have  been  used  mostly  as  ancillary  information  along  with  other  pose  estima¬ 
tion  methods  through  the  essential  matrix  (e.g.  Five-Point  algorithm)  in  computer  vision  [124,  78]  or 
resectioning  in  photogrammetry.  For  example,  in  [78,  124,  174,  41],  platform  and  sensor  GPS  and  IMU 
measurements  are  fused  with  an  SfM  approach  using  an  Extended  Kalman  Filter  or  as  extra  constraints  in 
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Figure  4:  Developed  SfM  and  georegistration  pipeline.  Noisy  metadata  are  directly  used  for  triangulation  and  robust  non-linear  LS  optimization, 
while  RANSAC  (or  its  variations)  is  avoided.  The  main  pipeline  produces  three  types  of  outputs  including:  refined  camera  parameters,  sparse 
3D  point  cloud  and  georegistered/stabilized  images.  Homography  transformations  in  the  georegistration  are  directly  (analytically)  computed 
from  refined  camera  parameters  (no  frame-to-frame  or  piecewise  homography  estimation).  Three  of  the  primary  applications  of  this  pipeline  are 
shown.  One  can  obtain  a  dense  3D  point  cloud  using  a  dense  3D  reconstruction  algorithm.  Moving  vehicle  detection  can  be  applied  on  the  well 
stabilized  images  for  the  purpose  of  tracking  (see  [175,  159]  for  details).  Video  analytics  [208,  209]  is  another  demanding  application  which  can 
benefit  from  the  proposed  georegistration  pipeline. 

BA  in  order  to  produce  3D  reconstructions  and  not  directly  as  in  our  proposed  robust  BA4S  approach. 

3.1.2  Structure-from-Motion  and  Bundle  Adjustment  using  BA4S 

This  section  introduces  details  of  BA4S,  our  developed  SfM  and  BA  algorithm.  A  flow  diagram  of  the 
pipeline  is  presented  in  Fig.  4.  We  start  with  describing  our  feature  extraction  and  matching  strategy  (the 
first  three  top-left  blocks)  and  continue  with  introducing  our  developed  approach  for  camera  pose  refine¬ 
ment  to  optimize  the  intrinsic  (interior)  and  extrinsic  (exterior)  camera  parameters  using  a  persistency- 
based  robust  error  function. 

3.1.2.1  Features,  Sequential  Matching  and  Tracking 

In  sequential  image  capture,  we  know  which  frames  are  adjacent  to  each  other,  as  in  persistent  aerial 
WAMI  [160]  or  hyper-lapse  first  person  videos  [114].  By  leveraging  this  powerful  temporal  consistency 
constraint  between  images  as  prior  information,  we  reduce  the  time  complexity  of  matching,  n  cameras, 
from  0(n 2)  to  O(n),  without  compromising  the  quality  of  BA  results  [1&  ].  In  our  proposed  approach, 
interest  points  are  extracted  from  each  image  using  a  robust  local  feature  detector.  Starting  from  the  first 
frame,  for  each  two  successive  image  frames,  the  descriptors  associated  with  interest  points  are  compared 
with  successive  matches  building  up  a  set  of  feature  tracks  without  using  RANSAC.  A  feature  track 
provides  evidence  that  a  potentially  unique  3D  point  in  the  scene  has  been  observed  in  a  consecutive  set 
of  image  frames.  Fig.  5  illustrates  a  scene  3D  point,  X7,  being  observed  by  7 j  cameras,  Cj~ . . .  C(fc+7j_!), 
in  a  row.  Its  corresponding  track,  Tj,  can  be  defined  as 

Tj  =  {xjfc,xi(fc+1),...xi(fc+7._1)  I  k  €  1  ...n,j  €  1 . . .  m,  2  <  7 j  <  n}  (1) 

where  denotes  the  2D  image  point  of  X7  in  the  image  plane  of  C&,  7 j  is  the  number  of  cameras 

sequentially  observing  the  3D  point  X7,  and  m  is  number  of  3D  points/tracks.  A  whole  set  of  tracks  in  a 
dataset  is  presented  by  T  =  {ti,  72, . . .  rm}.  In  our  SfM  approach,  we  consider  7 j  as  a  persistency  factor 
related  to  track  tj.  Indeed,  7 can  be  thought  as  a  factor  that  measures  the  temporal  co-visibility  of  the 
j-th  3D  point  in  the  image  sequence.  Temporal  co- visibility  was  used  in  the  literature  for  other  purposes 
such  as  object  recognition  [88].  Here  we  exploit  it  as  a  robustness  parameter  reflecting  the  reliability  in 
identifying  a  3D  scene  point.  Each  track  (i.e.  estimated  3D  feature  point  trajectory)  has  an  associated 
persistency  factor.  After  building  all  tracks,  T,  the  population  statistics  of  track  persistency  factor  for  m 
3D  points  are  estimated  including  the  mean,  fiF  =  ^  J2JLi  7 j  and  standard  deviation,  aF.  These  first 
and  second  order  track  persistency  statistics  are  used  to  appropriately  weight  each  track  in  a  novel  manner 
in  the  BA  optimization  formulation. 
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Figure  5:  Illustration  of  sequential  tracking.  A  scene  3D  point,  Xj,  is  observed  by  7 j  cameras,  Ck  • . .  C(k+^  ._i),  in  a  row.  They  constitute  a 
track  set,  Tj ,  with  the  members  defined  in  (1). 


Figure  6:  Bundle  of  rays  between  each  camera  center  and  the  set  of  3D  points.  This  is  an  exemplary  case  in  which  three  3D  points  are  observed 
by  three  cameras. 
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3.1.2.2  Robust  Non-linear  LS  Optimization 

Bundle  Adjustment  (BA)  refers  to  the  problem  of  jointly  refining  camera  parameters  and  3D  structure  in 
an  optimal  manner  often  using  reprojection  error  as  the  quality  metric.  Given  a  set  of  n  cameras  (see  Fig. 
6),  with  arbitrary  poses  (translations,  orientations)  and,  m  points,  BA  optimization  is  defined  as  a  least 
squares  minimization  using  the  L2-norm  or  sum-of-squared  reprojection  errors: 

n  m 

E  =  „  mm  “  $(Xj,  R*,  t;,  Ki) ||2  (2) 

1=1  J= 1 

where  R*,  L,  are  respectively  the  rotation  matrix,  translation  vector  and  (intrinsic)  calibration  matrix 
of  the  zth  camera,  Xj  is  the  j-th  3D  point  in  the  scene  and  observation  Xj^  is  the  2D  image  coordinates 
of  feature  Xj  in  camera  i  and  the  L2  norm  is  used.  The  mapping  g(Xj,  R^,  G,  K^)  is  a  transformation 
model  which  projects  a  3D  point  Xj  onto  the  image  plane  of  camera  i  using  its  extrinsic,  R^  and  G,  and 
intrinsic  parameters,  K i,  defined  as: 


where  is  the  projection  matrix  of  camera  i ,  defined  as: 

P i  ~  ^  [Rz|ti]  • 


(3) 

(4) 


The  intrinsic  parameters  of  the  camera(s)  are  defined  as: 


K  *  = 


fi  0  m 

0  fi  Vi 

0  0  1 


(5) 


fi  being  the  focal  length  in  pixels  and  the  (i^5  Vi)T  is  the  camera  principal  point.  One  can  consider  radial 
distortion  parameters  in  the  camera  calibration  matrix  for  both  computing  the  reprojection  error  and  also 
parameters  to  be  optimized  in  BA  (see  [222,  203,  216,  68]  for  more  details).  The  extrinsic  matrix  is 
composed  of  the  rotation  R*  and  translation  tq  of  the  camera: 


rn,i 

ri2,i 

ri3,i 

tx,i  \ 

T21,i 

r22,i 

r23,i 

ty,i  1  • 

(6) 

r3i,i 

rz2,i 

r33,i 

tz,i  ) 

However,  due  to  errors  in  the  metadata  and  feature  matching  outliers,  g(Xj,  R^,  G,  IQ)  ^  xJZ,  and  the 
minimization  problem  seeks  a  statistically  optimal  estimate  for  the  camera  projection  matrices  P^  and  3D 
feature  points  Xj.  The  L2  minimization  of  the  reprojection  error  involves  adjusting  the  bundle  of  rays  be¬ 
tween  each  camera  center  and  the  set  of  3D  points  which  is  a  non-linear  constrained  optimization  problem. 
Note  that  the  above  minimization  is  equivalent  to  finding  a  maximum  likelihood  solution  assuming  that 
the  measurement  noise  is  Gaussian;  see  [203,  95]  for  more  details.  There  exist  various  methods  to  solve 
the  non-linear  least  squares  problem  including  implicit  trust  region  and  Levenberg-Marquardt  methods 
that  are  well  established  in  the  BA  literature  [133,  106]. 

The  selection  of  2D  feature  point  correspondences  is  one  of  the  most  critical  steps  in  image-based 
multi- view  reconstruction  [13].  Feature  correspondences  are  usually  contaminated  by  outliers,  that  is 
matching  errors  or  incorrect  data  associations.  The  outliers  or  mismatches  may  be  caused  by  occlusions, 
repetitive  patterns,  illumination  changes,  shadows,  image  noise  and  blur  for  which  the  assumptions  of  the 
feature  detector  and  descriptors  are  not  satisfied  [79].  On  the  other  hand,  BA  which  is  usually  solved  using 
the  Levenberg-Marquardt  numerical  method  [203]  is  highly  sensitive  to  the  presence  of  feature  correspon¬ 
dence  outliers  [13].  Mismatches  can  cause  problems  for  the  standard  least  squares  approach;  as  stressed  in 
[  17]  even  a  single  mismatch  can  globally  affect  the  result.  This  leads  to  sub-optimal  parameter  estimation, 
and  in  the  worst  case  a  feasible  solution  is  not  found  [13,  151].  This  is  even  more  problematic  in  high 
resolution  images  that  have  a  large  number  of  features  and  potential  correspondences  which  increases  the 
probability  of  association  or  matching  errors.  Furthermore,  aerial  images  have  a  high  degree  of  parallax 
making  matching  and  feature  tracking  a  much  more  difficult  problem. 
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Generally  the  mismatches  are  explicitly  excluded  from  the  set  of  potential  feature  correspondence  in 
the  early  stages  of  the  conventional  SfM  pipeline  (Fig.  3)  well  before  the  BA  optimization  stage.  In 
this  approach  the  initial  camera  parameters  are  simultaneously  estimated  while  explicitly  detecting  and 
eliminating  outliers  usually  by  applying  different  variations  of  RANSAC.  In  our  proposed  SfM  approach 
(Fig.  4),  we  show  that  we  can  bypass  the  explicit  RANSAC-based  outlier  elimination  step  by  using  an 
appropriate  robust  error  measure. 

Robust  error  functions  also  known  as  M-estimators  are  popular  in  robust  statistics  and  reduce  the  influ¬ 
ence  of  outliers  in  estimation  problems.  We  have  observed  that  not  every  choice  of  a  robust  function  works 
well  [  ’1]  and  a  proper  robust  function  is  critical  for  achieving  a  robust  minimization  of  the  reprojection 
error  when  the  initial  parameters  are  too  noisy  and  outliers  are  not  explicitly  eliminated  beforehand.  Two 
commonly  used  robust  statistics  functions  are  the  Cauchy  (or  Lorentzian)  and  Huber  [203]  measures: 

•  Cauchy  or  Lorentzian  cost  function 

p(s)  =  b2  log(l  +  s2/b2)  (7) 


•  Huber  cost  function 


p(s) 


s2  if  |s|  <  b 

2  6|s|  —  b2  otherwise 


(8) 


where  s  is  the  residual  (i.e.  reprojection  error)  in  (2)  and  b  is  usually  one  or  a  fixed  user  defined  value. 
However,  such  standard  robust  error  functions  are  insufficient  when  directly  used  in  the  BA  optimization 
to  combat  high  percentage  of  outlier  feature  correspondence  errors  as  described  in  [21]. 

We  developed  a  novel  robust  function  which  uses  the  Cauchy  loss  or  robust  error  function  combined 
with  an  adaptive  persistency  factor  weight  for  each  feature  track.  The  persistency  factor  provides  a  weight 
based  on  the  number  of  consecutively  matched  features  compared  to  the  set  of  all  tracks,  to  reduce  the 
effects  of  outliers  in  the  optimization  error  metric: 


Pji  (Sji  i  7j  i  PF  i  )  =  ( 


7 3 


Pf  +  cr  f 


)2  log(l+( 


/iF+CTFx  2  2 


7 3 


Y  4) 


(9) 


where 

b  =  — ^ -  (10) 

/if  +  ctf 

is  the  Cauchy  scale  factor,  s3l  —  ||xJZ  —  g(Xj,  R^,  t*,  K^)  || 2  denotes  the  residual  of  the  j-th  3D  point  in 
the  z-th  camera  (i.e.  feature  track),  7 j  is  the  persistency  factor  related  to  j- th  3D  point,  and  pp  and  °f  are 
the  mean  and  standard  deviation  of  the  persistency  factor,  respectively,  for  the  population  of  feature  tracks. 
Substituting  (9)  into  (2)  we  obtain  a  new  robust  error  function  which  leads  to  the  global  minimization  over 
all  feature  tracks  and  views: 
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The  proposed  robust  function  is  inspired  by  the  Cauchy  or  Lorentzian  robust  function  [104,  203]  which 
has  an  influence  function  very  similar  to  the  Geman-McClure  robust  function  [149]  that  decreases  rapidly 
reducing  the  effect  of  large  outlier  values.  The  residuals,  sp  in  (9)  associated  with  each  feature  track 
are  weighted  adaptively ,  with  longer  lived  feature  tracks  being  favored  (larger  7 j)  over  residuals  with 
shorter  length  feature  tracks  (smaller  7 j).  So  a  larger  persistency  weight  favors  longer  co-visibility  features 
that  are  more  likely  to  represent  the  same  real  3D  structure  point  in  the  scene.  The  proposed  adaptive 
persistency  factor  weights  using  the  modified  Cauchy  robust  function  in  (9)  performed  the  best  compared 
to  the  standard  Cauchy  or  Huber  robust  functions  without  persistency  [  ]. 
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3.2  Georegistration  using  Global  Homography 

Vehicle  detection  and  tracking  is  one  of  the  primary  demands  in  WAMI  aerial  imaging.  To  improve  de¬ 
tection  and  tracking  in  aerial  imagery  in  which  videos  are  captured  on  a  moving  platform,  the  images  are 
registered  to  maintain  the  relative  movement  between  the  moving  platform  and  the  scene  fixed.  Tradition¬ 
ally,  aerial  image  registration  methods  are  performed  through  applying  2D  homography  transformations 
in  the  image  space  [130,  99,  120].  Aerial  image  registration  is  challenging  for  urban  scenes  where  there 
are  large  3D  structures  (tall  buildings)  causing  high  amount  of  occlusion  and  parallax.  In  such  situations, 
the  presence  of  parallax  can  lead  to  significant  error  when  image  space  2D  registration  approaches  are 
used  [58].  A  method  to  register  aerial  images  is  developed  in  this  project  that  utilizes  3D  information,  par¬ 
ticularly  camera  poses  available  from  the  proposed  fast  and  robust  camera  poses  optimization  technique, 
to  speed  up  the  process  and  eliminate  the  effects  of  strong  parallax  and  occlusions.  Unlike  the  traditional 
registration  methods,  the  proposed  technique  uses  special  homography  transformations  which  are  directly 
computed  (derived  as  closed-form  analytic  expression)  from  available  precise  3D  camera  poses  and  are 
not  estimated  using  image-to-image  estimation  techniques. 

3.2.1  Related  Work 

The  majority  of  approaches  for  mosaicing  and  image  stabilization  use  image-based  matching  and  warping 
either  global  or  piece-wise  local  transformation  for  stabilizing  the  ground  plane  prior  to  moving  object 
detection  [130,  99,  120,  141,  183,  137,  129,  54,  201,  207,  198,  90,  154].  As  mentioned  earlier  and  stressed 
in  [58,  224],  aerial  image  registration  is  challenging  for  urban  scenes  where  there  are  large  3D  structure 
and  tall  buildings  causing  high  amount  of  occlusion  and  parallax.  An  aerial  image  registration  method 
was  proposed  in  [130,  10  ]  which  uses  a  multi-layer  (coarse  to  fine)  homography  estimation  approach 
to  deal  with  parallax  and  occlusions.  Although  using  a  hierarchical  homography  estimation  helped  to 
reduce  false  registration,  their  approach  still  suffer  from  the  presence  of  strong  parallax,  as  it  was  not  able 
to  seamlessly  register  all  images  within  a  dataset  altogether.  By  observing  Table-I  in  their  paper([130]) 
one  can  see  that  each  dataset  was  broken  into  several  segments  in  the  registration  process,  due  to  an 
inability  to  faithfully  handle  strong  parallax.  Molina  and  Zhu  [14  ]  proposed  a  method  to  register  nadir 
aerial  images  in  which  a  pyramid  block-based  correlation  method  was  used  to  estimate  inter- frame  affine 
parameters.  They  stated  not  being  able  to  use  available  GPS/INS  measurements  and  just  relying  on  the 
imagery  itself:  "the  measures  made  by  GPS/INS  devices  come  with  errors  due  to  hardware,  and  [if] 
used  directly  will  produce  panoramas  with  large  apparent  errors  and  discontinuities" .  Their  approach 
requires  persistent  (multiple)  cyclic  video  data  collections  to  work.  Moreover  their  approach  has  been  only 
tested  on  nadir  imagery  with  negligible  parallax  issues,  while  in  oblique  imagery  (WAMI)  the  parallax  is 
significantly  stronger.  Direct  georeferencing  of  high-resolution  unmanned  aerial  vehicles  (UAV)  imagery 
was  discussed  in  [205]  while  performances  of  different  SfM  softwares  (Photoscan  [3],  Pix4D  [7]  and 
Bundler  [6,  196])  were  evaluated.  A  dataset  comprising  143  images  of  5,184  x  3,456  pixels  was  used. 
The  authors  concluded  that  Photoscan  has  the  best  performance  as  it  is  fast,  easy  to  use  and  accurate. 
Photoscan,  Pix4D  and  Bundler  each  took  4.3  hours,  11  hours  and  41  hours,  respectively,  to  run  on  this 
dataset.  The  best  achieved  result  (4.3  hours)  in  this  experiments  which  is  from  Photoscan  is  yet  very  far 
away  from  our  result  on  a  similar  dataset  ("Columbia",  202  aerial  images  of  6,600  x  4,400  pixels-  see 
our  Table  2)  which  took  less  than  two  minutes  to  run  the  full  pipeline.  Pritt  from  Lockheed  Martin  [180] 
proposed  a  fast  orthorectification  method  for  registration  of  thousands  of  aerial  images  (acquired  from 
small  UAVs).  The  author  argued  that  BA  is  not  able  to  handle  hundreds  of  aerial  images  and  therefore  it 
is  not  scalable.  According  to  that,  a  technique  was  introduced  as  a  replacement  to  BA  for  the  registration 
process.  The  best  performance  was  reported  to  be  250  images  per  hour ,  or  about  14  sec  per  image ,  for  a 
dataset  including  4,500  images  with  sizes  of  4,000  x  6,000  pixels.  This  experiment  is  comparable  with  our 
"Columbia-II*,  MO"  dataset  (see  Table  2  in  this  paper)  which  includes  5,322  images  with  size  of  6,600  x 
4,400  pixels.  Compared  to  our  method,  which  took  only  41  min  and  45  sec  for  5,322  images  (0.47  sec 
per  image),  Pritt’s  method  is  more  than  30  times  slower  than  our  BA4S.  Notice  that  the  results  presented 
in  [180]  appear  to  be  tested  over  relatively  flat  terrain  with  negligible  parallax.  Crispell  et  al.  introduced 
an  image  registration  technique  to  deal  with  parallax,  assuming  to  have  a  dense  3D  reconstructed  model 
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Figure  7:  Concept  of  parallax  in  aerial  imagery.  A  scene  and  its  dominant  ground  plane  7r  is  observed  by  n  aerial  cameras.  represents 
the  homography  transformation  between  the  image  plane  of  C\  and  the  plane  7 r.  For  an  on-the-plane  3D  point  such  as  Xi,  its  homographic 
transformations  from  the  camera  image  planes  onto  7r  will  all  merge  together  and  coincide  to  Xi  (the  green  point).  However,  for  an  off-the-plane 
point  such  as  X2,  its  homographic  transformations  will  be  spread  out  (see  the  red  points  .  .  .7r  xj ).  These  red  points  are  spurious  and 

induced  by  the  parallax. 

of  the  scene  [58].  In  [181]  GPS  and  IMU  were  used  to  perform  an  initial  (coarse)  orthorectification 
and  georeferencing  of  each  image  in  an  aerial  video.  Then  a  RANSAC-based  method  was  used  to  find 
optimal  affine  transformations  in  2D  image  space.  A  method  for  registering  and  mosaicking  multi-camera 
images  was  proposed  in  [99].  In  the  proposed  method,  registration  is  achieved  using  control  points  and 
projective  image-to-image  transformations  (using  a  variation  of  RANSAC).  Among  the  corresponding 
control  points  found  in  overlapping  images,  those  that  best  satisfy  the  projective  constraint  are  used  to 
register  the  images.  Their  system  took  about  15  minutes  to  estimate  the  transformation  matrices  for 
registering  just  six  overlapping  images,  each  of  size  4,008  x  2,760,  which  is  considerably  very  slow.  A 
technique  for  optimal  feature  matching  was  recently  proposed  in  [135],  called  locally  linear  transforming 
(LLT),  which  tries  to  tackle  outliers  using  the  consistency  of  local  features  within  a  neighborhood.  A 
local  geometrical  constraint  was  developed  that  can  preserve  local  structures  among  neighboring  feature 
points  which  is  also  robust  to  a  large  number  of  outliers.  It  has  a  relatively  high  complexity  and  also  uses 
exhaustive  iterative  methods;  two  drawbacks  in  the  feature  matching  stages  of  a  SfM  pipeline  which  we 
are  avoiding  in  our  proposed  approach.  A  similar  algorithm,  called  Restricted  Spatial  Order  Constraints 
(RSOC),  was  proposed  in  [131]  to  deal  with  outliers  for  registering  aerial  images.  Both  local  structure 
and  global  information  were  considered  in  RSOC.  It  assumes  that  neighbor  spatial  order  is  preserved  after 
rigid  and  affine  transformation  and  based  on  that  an  affine  invariant  descriptor  was  defined.  However 
such  assumption  for  oblique  aerial  imagery  of  urban  scenes  is  not  held,  due  to  existence  of  high  parallax. 
Lee  et  al.  introduced  a  method  for  image  registration  of  airborne  LiDAR,  hyper- spectral  and  photographic 
imagery  of  wooded  landscapes  [  20].  A  nonparametric  (NP)  image  registration  technique  was  proposed  to 
align  images  obtained  from  multi-sensor  imaging.  The  developed  method  was  mainly  for  the  registration 
of  three  types  of  airborne  remote  sensing  data  sampled  over  wooded  landscapes. 

3.2.2  Georegistration  Technique 

Using  the  previously  described  approach  to  refine  camera  parameters  one  can  globally  register  the  images 
in  geo-reference  system.  Fig.  7  illustrates  a  world  coordinate  system  and  a  ground  plane  n  spanning 
through  its  X  and  Y  axes.  Without  loss  of  generality,  we  consider  to  be  on  7 r.  The  scene  is  observed 
by  n  aerial  cameras  C\,  C2  . . .  Cn.  To  make  the  notations  succinct,  we  will  omit  the  camera  indices 
from  now  on  unless  otherwise  stated.  Let  us  consider  a  projective  camera  C,  with  coordinate  system  J^c, 
observing  the  scene.  Using  (4),  the  image  coordinate  (expressed  in  homogeneous  system)  of  a  general  3D 
point  X  =  [x  y  z]J  from  ,XW  projected  on  the  image  plane  is  obtained: 

x  =  K  [R  X  +  t]  (12) 

where  K  is  the  camera  calibration  matrix  (intrinsic  parameters),  R  and  t  are  respectively  the  rotation 
matrix  and  translation  vector  from  ,XW  to  J^c.  The  Z  coordinate  of  a  3D  point  X  lying  on  n  is  zero 
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yielding: 


x  =  K  ^[ri  r2  r3]  [x  y  0]T  +  fj  (13) 

ri,  r2  and  r3  being  the  first,  second  and  third  columns  of  R,  respectively.  After  simplification  we  have 

x  =  K  [n  r2  t]  (14) 

where  ^x  =  [x  y  1]T  represent  the  2D  homogeneous  coordinates  of  3D  point  X  on  tv.  Indeed,  one  can 
consider  the  term  K  [ri  r2  t]  analogue  to  a  3  x  3  homography  transformation  matrix  which  maps  any 
2D  point  from  tv  onto  the  camera  image  plane  as: 

i  =  cH7r  "x.  (15) 


Likewise,  a  2D  homogeneous  image  point  x  can  project  back  on  tv  as: 

"x  =  CH“1X  =  'Hcx 

where  nIlc  is  equal  to: 

fHc  =  [ri  r2  t]_1  K_1 
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The  columns  of  rotation  matrix  R  =  [ri  r2  rs]  are  orthogonal,  i.e.  ri  x  r2  =  r3,  or 
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Using  (22),  after  simplification  we  obtain: 
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where  v  =  [u  v  f] T  and  A  is  a  scalar  defined  as 


A  =  /  rjt. 


(23) 


(24) 


A  in  (23)  can  be  omitted  since  a  homography  matrix  is  defined  up-to-scale.  As  a  result,  we  obtain  the 
following  analytical  expression  for  the  georegistration  homography  matrix: 
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Figure  8:  Computation  of  a  fundamental  matrix  Fqi  corresponding  to  the  images  of  q- th  and  l- th  cameras  using  their  intrinsic  and  extrinsic 
parameters,  x^g  and  x/^  are  two  corresponding  image  points  (related  to  a  3D  point  X*.).  Due  to  noise  in  camera  parameters  or  feature  point 
correspondences,  the  fundamental  constraint  (x^  Fqi  x^ q  =  0)  may  not  be  held,  yielding  to  an  error  d. 

where  its  elements  hV]  are  as  in  (23)  and  we  note  the  following  simplification  for  h\z  and  ^23- 

/113  —  -  [/ill,  /112,  r  12^2  -  r22^i]  V 

(26) 

/l23  =  —  [/l21,  /l22,  —  7*11^2  +  7*2ltl]  V. 

A  homography  relationship  is  always  induced  by  a  3D  plane  in  the  scene  (in  our  case  the  georegistration 
ground  plane  7 r).  In  the  noise  free  case  (for  the  camera  pose  parameters),  the  back  projection  errors  for 
points  on  the  ground  plane  will  be  zero  -  that  is  the  back  projections  of  the  (2D)  image  points  from 
each  camera,  of  a  3D  point  lying  on  the  scene  plane  7 r,  will  map  to  this  same  3D  point.  However,  back 
projection  of  2D  image  points  associated  with  a  3D  scene  point  that  is  off-the-ground-plane  (not  lying  on 
7r)  will  be  scattered  lying  on  a  contour  that  depends  on  the  height  of  the  object  and  follows  the  trajectory  of 
the  camera  (i.e.  clockwise  example  shown  by  the  red  rays  in  Fig.  7).  This  is  referred  to  as  motion  induced 
parallax.  For  example,  an  on-the-plane  3D  point  such  as  Xi  has  homographies  or  back  projections  from 
each  of  the  cameras  77  x\,  77  x\  . . .  77  X™,  that  coincide  with  the  true  location  Xi.  However,  for  off-the- 
plane  points  such  as  X2  its  homographies  or  back  projections  will  be  spread  out  as  shown  by  the  points 
77  x\,  77  x^  . . .  77  x^.  The  motion  induced  parallax  of  such  tall  structures  will  produce  apparent  motion  in 
the  georegistered  image  sequence  once  the  images  are  stabilized  using  the  direct  homographies.  The  same 
approach  can  be  extended  to  work  with  a  camera  array  made  up  of  multiple  focal  planes  on  the  same  aerial 
platform  or  on  multiple  platforms  in  a  sensor  network. 

3.3  Accuracy  Evaluation  Using  Fundamental  Matrix-based  Euclidean  Epipolar  Error 
(EEE) 

In  aerial  WAMI  reconstructions,  it  is  not  always  practical  to  provide  a  quantitative  evaluation  of  the  results 
due  to  a  lack  of  available  3D  ground- truth  that  is  both  expensive  and  difficult  to  collect  [82].  Generally,  the 
reprojection  error  is  commonly  used  for  evaluating  SfM  results.  However,  in  our  pipeline  the  standard  L2 
reprojection  error  might  not  be  a  merit  metric  to  compare  it  against  other  SfM  methods,  since  outliers  are 
not  explicitly  eliminated  in  our  approach.  Here  all  spurious  scene  points  as  well  as  valid  3D  points  across 
all  cameras  will  contribute  to  the  reprojection  error,  while  our  primary  objective  is  to  assess  the  accuracy 
of  the  recovered  camera  pose.  Therefore  for  comparison,  we  use  a  pixel-based  error  measure,  to  evaluate 
the  SfM  results,  which  uses  the  classical  epipolar  constraint  to  evaluate  the  quality  of  the  refined  camera 
parameters  on  the  image  plane  using  manually  tracked  feature  points.  We  call  this  method  as  Fundamental 
matrix-based  Euclidean  Epipolar  Error,  or  EEE.  For  evaluation  purposes,  we  generate  image-based  ground 
truth  by  manually  tracking  Ng  feature  tracks,  in  each  sequential  or  WAMI  dataset  preferably  over  all 
cameras  or  views;  typically  Ng  =  1 1  to  50.  Given  reference  camera,  q,  then  for  each  possible  camera  pair 
(q,  l ),  their  corresponding  fundamental  transformation  matrix  [95]  can  be  analytically  computed  (without 
using  image  point  correspondence-based  estimation)  using  camera  parameters  (i.e.  metadata).  Consider 
a  pair  of  cameras,  Cq  and  Cu  with  associated  projection  matrices  ¥q  and  P /,  and  centers  Cq  and  Q, 
respectively.  Assume  a  scene  3D  point  X&  is  projected  on  the  image  planes  of  Cq  and  Ci  respectively 
as  Xkq  and  x^.  Given  the  2D  image  point  x^q,  we  can  determine  the  set  of  3D  points  in  the  scene  that 
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map  to  this  point.  Such  a  set  will  constitute  a  ray  in  the  scene  passing  through  Cq.  Two  points  on  this 
ray  are  already  known  -  one  is  the  camera  center  Cq  (notice  that  ~PqCq  —  0  )  and  the  second  point  is 
P  q*kqi  where  P+  denotes  the  pseudo-inverse  of  Pq  (i.e.  Pq  P+  =  I).  Point  P+  lies  on  the  ray  since 
it  projects  to  as  Pg(P+  x^)  =  x/^.  These  two  points,  Cq  and  P+x&g,  are  imaged  by  the  second 
camera  Ci  at  P/Cg  and  P/P+x/^,  respectively.  These  two  imaged  points  in  the  second  view  (camera  Ci) 
generate  the  epipolar  line , 

L ki  =  (P«C9)  x  (P,P+xkq),  (27) 

The  projection  of  the  camera  center  in  the  second  view,  or  point  PiCq,  is  the  epipole  of  Cq  (see  e2  in 
Fig.  8).  The  epipolar  line  in  (27)  can  be  expressed  as 

Lfei  =  (P;Cg)  x  (P,P+)xkq 

=  [e2]x(P,P+)xkq  (28) 

=  Pg/  ^kq 


where  F qi  is  the  fundamental  matrix  which  maps  any  2D  image  points  from  camera  Cq  to  the  image  plane 
of  Ci,  and  defined  by 

Fqi  =  [e2]x(P*P+)  (29) 


Notice  that  if  a  =  [ai,  a2,  a^]T  is  a  3-vector,  then  [a]  x  defines  the  corresponding  3x3  skew -symmetric 
matrix  as  follow  [9  ] : 
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The  fundamental  matrix  in  (28)  can  be  developed  to  be  expressed  by  the  camera  parameters  (metadata). 
Without  loss  of  generality,  assume  the  coordinate  system  of  camera  Cq,  as  the  world  origin.  Also 
assume  that  lHq  and  t.r/  are  respectively  the  rotation  matrix  and  translation  vector  from  -  A(/  to  ->/, 


Pi  =  K?[l3xi|03Xi],  and  P2  =  K;[*Rg|*tg]. 


(31) 


Substituting  P,,  and  P/  in  (29)  respectively  with  Pi  and  P  >  defined  in  (31)  gives  [95], 

F9i  =  [P;C9]x(PiP+) 

=  [K,  hq]x  K,  lKq  K-1 

=  K-T['tg]x  ^K'1  (32) 

=  KrT'R(/[;R(yt(/]x  k-1 
=  KrlKq  KT[Kg!RT(t?]x 


The  rotation  matrix  and  translation  vectors  between  the  two  camera  coordinate  systems,  lHq  and  ltq,  are 
obtained  using  their  rotation  matrices  and  translation  vectors  from  the  metadata  expressed  in  &w\ 


*R9  =  R/  RJ 
ltq  =  ti  -  R;  RJ  tg 


(33) 


From  (33)  and  (32)  we  derive  the  following  analytical  expression  for  the  fundamental  matrix  between  the 
image  planes  of  two  cameras: 


F ql  =  K“T  Rj  RT  KT  [Kq  RT  (ti  -  R i  RJ  t,)]  x 


(34) 


For  the  fc-th  3D  groundtruth  point,  g^,  with  k  —  1 . . .  Ng,  each  projected  into  camera,  l,  its  corresponding 
epipolar  line  induced  by  g kq  from  Cq  is  computed  and  plotted  in  the  reference  frame  l.  The  sum  of  the 
perpendicular  Euclidean  distances  (see  d  in  Fig.  8)  between  each  epipolar  line  and  its  associated  ground- 
truth  point,  averaged  over  all  points,  is  used  as  the  error  measure  between  any  pair  of  cameras: 

^i  =  ^rEklid(Ski,Fqlgkq)  =  EEE  (35) 

7V 9 
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This  error  (referred  to  as  EEE)  is  computed  over  all  possible  pairs  of  cameras  in  the  sequence,  {(g,  l)  \  q,  l  = 
1 ..  n}.  Ideally  eqi  should  be  zero  due  to  the  fundamental  geometric  constraint.  However,  the  triple  prod¬ 
uct  F qi  g kq  ^  0,  in  real  scenarios  due  to  errors  in  either  point  correspondences  or  camera  parameters. 
The  errors  eqi  can  be  treated  as  a  matrix  and  visualized  using  colored  picture  elements  (pels).  In  addition 
to  computing,  eqi ,  between  cameras,  q  and  /,  the  mean  fie  and  standard  deviation  cre  of  the  error  over  all 
cameras  is: 


Me 


n 


sr^Ti  /  v— \n  /  \ 

2-^q=  1  2-^1= 1  €(ll  ’  ae  ~  \  Z2  2^q=l  2^1=1  \€(Il  ~  Me) 


(36) 


3.3.1  Evaluating  Feature  Matches  Without  Image-based  Ground-Truth 

Normally  for  real  data,  specially  large  aerial  images,  it  is  not  feasible  to  have  image-based  ground  truth  for 
evaluating  the  features  and  the  matches  among  them.  Sometimes  it  is  very  useful  to  apply  different  sort 
of  feature  extraction  algorithms  and  different  kind  of  matching  schemes  and  evaluate  their  performance 
relative  to  each  other  for  specific  scenarios.  Limited  ground-truthing  can  be  done  and  we  have  been 
doing  this  routinely  to  evaluate  the  quality  of  the  camera  metadata  corrections.  However,  evaluating  the 
corrected  camera  parameters,  with  hundreds  to  thousands  of  cameras,  without  any  ground-truth  has  not 
been  practical  so  far  (to  the  best  of  our  knowledge).  Alternative  methods  for  evaluating  the  quality  of  the 
metadata  or  comparing  the  results  from  different  algorithms  are  needed  due  to  the  lack  of  sufficient  ground- 
truth  for  a  well  distributed  set  of  features  and  their  matches  or  correspondences  in  large  datasets  like 
WAMI.  Considering  the  importance  of  the  topic,  we  propose  to  implement  and  test  algorithms  that  have 
the  potential  to  alleviate  such  barriers  by  automatically  evaluating  the  quality  of  features/matches  with 
limited  or  no  manual  ground-truth  and  assist  the  process  of  identifying  accurate  point  correspondences. 
The  approach  is  based  on  using  bundle  adjusted  camera  parameters.  The  optimized  parameters  will  be 
used  as  ground- truth  (for  camera  parameters  but  not  features).  For  each  track  of  corresponding  features, 
the  middle  frame  is  considered  as  the  reference.  All  points  of  interest  (features)  from  each  image  in  the 
track  is  projected  onto  the  reference  image  using  epipolar  lines.  The  Euclidean  distance  between  each 
epipolar  line  and  its  corresponding  point  in  the  reference  frame  is  computed  as  an  error  value.  Ideally 
such  a  value  should  be  zero  but  this  is  not  the  case  in  real  scenarios  due  to  various  noise  processes  and 
modeling  errors.  The  mean  and  covariance  of  these  error  values  have  high  precision  and  will  be  considered 
as  a  metric  to  evaluate  the  quality  of  features  and  precision  of  the  matching  algorithm. 


3.3.2  GUI  for  BA,  Metadata  Validation  and  Ground  Truth  Creation 

The  Epipolar  Transformation  (EpiX)  visualization  tool  is  used  to  monitor  the  quality  of  the  raw  and  cor¬ 
rected  metadata  as  this  is  crucial  for  successful  estimation  of  dense  3D  point  cloud  reconstructions.  Ex¬ 
isting  tools  like  VisualSFM  have  been  developed  but  are  customized  for  different  3D  workflows  and  do 
not  include  efficient  ground-truthing  tools  like  EpiX  with  its  optimized  projection  modules.  EpiX  uses 
the  Qt  GUI  software  development  toolkit,  which  is  a  cross-platform  application  framework  for  enabling 
the  EpiX  software  to  be  deployed  on  Microsoft  Windows,  Apple  OS  X  and  Linux  platforms.  EpiX  also 
uses  the  openCV  libraries  for  supporting  matrix  operations  and  low  level  vision  modules  such  as  feature 
operators.  EpiX  application  consists  of  three  main  components  that  are  planned  to  support: 

•  File  loading:  a)  loading  metadata  files,  b)  Loading  and  displaying  image  thumbnails 

•  Structure  from  motion  (SfM):  a)  Feature  extraction  &  tracking,  b)  Loop  closure,  c)  Triangulation  & 
Bundle  Adjustment  preprocessing,  d)  Bundle  Adjustment  optimization,  e)  Alignment  with  metadata 

•  Supporting  Tools:  a)  Visual  assessment  using  epipolar  lines,  b)  Plotting  errors  in  camera  position,  c) 
Calculating  the  EEE  matrix,  d)  Applying  PM  VS  (Patch-based  Multi- view  Stereo)  for  assessment  or 
initialization,  e)  Plotting  trajectories  of  points,  f)  Ground-truth  creation,  g)  Ground-truth  validation 

In  the  ground-truth  creation  module  the  user  is  provided  with  a  convenient  tool  to  quickly  produce  a 
long  tracked  sequence  of  stationary  object  feature  points  (such  as  the  corner  of  a  building  or  roof  struc¬ 
tures).  These  ground-truth  points  can  be  used  to  evaluate  the  improvement  in  metadata  quality  as  a  result 
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of  applying  the  BA  module  and  checking  various  projective  errors  of  the  manually  selected  ground-truth 
points.  The  user  loads  images  and  metadata  using  EpiX  in  the  first  step.  The  user  then  manually  selects 
and  marks  a  feature  point  of  interest  in  two  frames  well  separated  in  time,  that  is  with  a  sufficiently  large 
enough  baseline,  after  which  the  projection  algorithm  in  EpiX  predicts  and  displays  the  expected  posi¬ 
tion  of  the  feature  point  within  all  of  the  other  frames  using  the  epipolar  geometry  constraint  as  shown 
in  Figure  9.  The  ground  truth  validation  module  is  used  to  visually  inspect  the  ground  truth  data  points 
and  make  any  small  corrections  for  creating  a  reproducible  collection  of  feature  points.  Users  are  able  to 
visualize  the  displacement  errors  of  the  marked  feature  points  in  the  data  collection. 

3.4  Voting-based  Reconstruction 
3.4.1  De  Novo  Voting 

Given  an  accurate  camera  array  sensor  model  and  platform  pose  and  position  with  respect  an  earth  centered 
frame  of  reference  we  have  implemented  a  voting  based  algorithm  for  reconstruction.  The  votes  are 
accumulated  along  rays  cast  from  the  camera  center  of  projection  to  the  scene  for  selected  interest  points 
identified  on  the  basis  of  local  feature  structures.  The  voxel  space  geometry  is  shown  in  Figure  10.  The 
blue  circle  shows  the  flight  trajectory,  camera  coordinate  system  and  retinal  plane  is  shown  in  red,  and 
the  voxel  space  contains  one  building  in  the  center  of  the  scene.  Rays  going  from  the  camera  center  of 
projection  to  the  visible  corners  of  buildings  faces  are  shown  in  black,  and  the  intersection  of  all  of  these 
rays  with  the  ground-plane  for  one  complete  circular  orbit  is  shown  as  green  circles.  The  scale  of  the 
building  and  flight  trajectory  is  exaggerated  to  facilitate  visualization. 

Voting  is  the  first  critical  part  of  our  3D  reconstruction  algorithm  and  it  is  distinct  from  other  multi¬ 
view  approaches  that  are  matching  centric.  Our  voting  approach  involves  covariant  feature  detection  in 
each  frame  but  no  explicit  feature  matching  (for  triangulating  points).  Images  are  acquired  from  an  aerial 
platform  flying  in  a  circular  flightpath  at  an  altitude  of  approximately  1500  meters  (Figure  10).  The  radius 
of  the  circular  trajectory  is  about  3000  meters.  For  the  reconstruction,  we  use  a  volumetric  approach.  A 
voxel  box  is  computed  around  the  targeted  area  and  we  select  for  each  view,  distinctive  feature  points  in 
the  image  and  trace  the  corresponding  ray  into  the  scene.  We  use  ray  casting  from  the  camera  center  of 
projection,  through  the  voxel  box,  and  vote  for  voxels  along  the  ray;  so  a  voxel  receives  as  many  votes  as 
rays  that  traverse  it.  Given  a  set  of  images  and  an  accurate  estimate  of  the  camera  exterior  orientations 
as  well  as  intrinsic  camera  parameters  the  voting  approach  is  summarized  in  Algorithm  1.  There  are  a 
number  of  improvements  to  the  basic  voting  process  that  are  required  to  improve  the  quality  of  the  3D 
point  cloud  reconstruction  -  reducing  outlier  clusters  or  false  structures  that  may  be  floating  or  attached  to 
other  objects,  gaps  or  holes  in  buildings  such  as  rooftops  and  walls,  vegetation  and  terrain.  Some  of  these 
improvements  that  will  investigated  are  described  in  the  following  sections. 


Algorithm  1  Voting  Algorithm 

1:  V oxelBox  =  InitializeVoxelBox 
2:  for  Imagelndex  =  1  :  NImage  do 

3:  Feature  Point  List  —  Get  Feature  Point  List  {Imagelndex) 

4:  Camera  Parameters  —  GetCamer  a  Parameters  {Imagelndex) 

5:  for  Feature  Index  —  1  :  N  Features  do 

6:  Ray  =  GetRay{FeaturePointList{FeatureIndex) ,  C amer  a  Parameters) 

7:  [Entry  Point,  Exit  Point]  =  G  et  Ray  AndV oxel  B  ox  Inter  sections  (Ray ,  Voxel  Box) 

8:  V oxelList  m  Bresenham3D  {Entry  Point,  Exit  Point) 

9:  for  Voxellndex  —  1  :  NV oxel  do 

10:  Vox  el  List  {Vox  el  Index)  +  + 

1 1 :  end  for 

12:  end  for 

13:  end  for 
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epipolarlmage 


epipolarlmage 
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(a)  Drawing  epipolar  line  mode  -  frame  t 


(b)  Drawing  epipolar  line  mode  -  frame  t+1 


S3 


(d)  Ground  truth  creation  mode  -  Frame  t+1 


(e)  Frame  t - Frame  t+1 - Frame  t+  50 


Figure  9:  Ground  truth  creation  based  on  marked  building  feature  points  in  two  different  views  50  frames  apart  in  time,  the  intermediate  estimated 
feature  point  determined  semi-automatically,  and  close  up  views  of  the  building  feature  showing  the  accuracy  of  user  localization  (red  points) 
and  algorithm  estimated  location  (green  points). 
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Figure  10:  Representation  of  the  flight  trajectory  (blue  circle),  camera  coordinate  system  and  virtual  plane  (in  red),  The  voxels  box  containing 
one  building  is  in  the  center  of  the  scene.  Rays  going  from  the  camera  center  of  projection  to  the  visible  corners  of  buildings  faces  are  represented 
in  black,  and  the  intersection  of  all  of  these  rays  for  one  complete  circular  flight  with  the  ground  is  displayed  as  green  circles.  The  scale  of  the 
building  and  flight  trajectory  are  not  realistic  on  purpose,  in  order  to  facilitate  the  visualization. 

3.4.2  Iterative  Voting  With  Prior  Models 

The  first  phase  of  bundle  adjustment  optimizes  the  metadata  precision  and  at  the  same  time  also  produces 
a  sparse  point  cloud  by  triangulating  SIFT  or  other  feature  point  matches,  that  could  be  used  as  prior 
information  for  voting.  However,  this  initial  structure  is  not  only  sparse  but  also  noisy  and  incorporating 
noisy  observations  during  the  voting  process  can  degrade  the  reconstruction  results.  One  possible  approach 
would  be  to  compute  for  each  frame  a  prior  depth  map  that  would  be  used  to  reduce  the  length  of  the  rays 
during  voting  as  shown  in  Figure  11.  This  would  filter  some  degree  of  the  noise  and  obtain  a  significant 
speed-up  as  the  Bresenham  ray  casting  is  the  most  expensive  computation  and  its  execution  time  depends 
directly  on  the  ray  length.  Furthermore,  the  process  can  be  repeated  in  loops:  for  each  new  iteration,  the 
denser  and  more  accurate  prior  depth  map  is  used  as  a  prior  constraint,  allowing  votes  to  accumulate  on 
more  and  more  features,  with  shorter  and  shorter  rays.  In  such  an  iterative  process  the  voxel  resolution 
could  be  increased  accordingly  provided  that  there  is  sufficient  memory  resources  to  maintain  the  full 
volume.  The  use  of  geospatial  priors  and  multimodal  geoinformatics  enables  persistent  geoint  fusion  in  a 
consistent  manner. 


Figure  11:  Comparison  between  the  basic  voting  and  voting  with  priors  approaches.  Shortening  the  portion  of  the  rays  that  is  rasterized  will 
speed  up  the  voting  process,  and  focus  the  votes  within  a  smaller  zone  that  should  reduce  the  noise  and  increase  the  foreground  surface  signal. 


3.4.3  Incorporating  Prior  Terrain  Model  Information 

Recovering  an  accurate  terrain  model  is  a  very  challenging  task  in  3D  reconstruction  as  the  terrain  usually 
has  less  distinctive  features  than  manmade  structures.  The  current  algorithm  does  not  incorporate  any  prior 
information  about  the  terrain  and  computes  the  ground  as  the  lowest  horizontal  plane  of  the  bounding  box. 
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The  bounding  box  is  picked  carefully  so  that  its  minimum  altitude  is  set  to  the  estimated  lowest  ground 
level  within  the  reconstructed  geographical  region.  However,  this  approximation  can  be  a  source  of  signif¬ 
icant  error  on  some  datasets  since  the  ground  elevation  is  obviously  non-planar  within  the  reconstructed 
region.  Surface  elevation  information  based  on  geological  factors  that  enable  patch  similarity  to  be  used 
to  relate  semblance  in  different  parts  of  the  scene  within  the  field  of  view  which  will  enable  improved 
terrain  reconstruction.  Figure  12  shows  a  summary  of  several  possible  errors.  Some  terrain  information  is 
available  through  the  NASA  WorldWind  toolkit  as  well  as  other  sources  such  as  USGS  and  NGA  and  will 
be  incorporated  as  prior  information  to  improve  the  voting  process  and  the  final  result.  Situation  specific 
priors  from  previous  reconnaissance  of  the  region  can  also  be  used.  Not  only  will  the  reconstruction  of  the 
ground  elevations  be  more  accurate  than  current  results,  but  it  would  also  improve  several  other  parts  of 
the  pipeline  such  as  voting,  vote  collapsing  or  extrusion,  computation  of  prior  depth  maps  and  automatic 
determination  of  bounding  box  parameters.  The  alignment  of  height  fields  from  different  sources  needs  to 
be  precise  to  enable  fusion  of  these  different  sources  of  3D  information  and  will  likely  require  matching 
tie  points  and  ground  control  points. 


Voxel  Box 


(a)  Real  world  structures  (b)  Expected  inaccurate  reconstruction 

Figure  12:  Depending  on  whether  the  bounding  box  ground  plane  is  too  high  or  too  low,  the  reconstruction  algorithm  might  miss  some  parts  of 
the  structure,  or  the  vote  collapsing  might  result  in  incorrectly  reconstructed  supporting  structures. 


3.4.4  Coarse-to-fine  Vote  Spaces 

It  may  be  beneficial  to  use  a  multiscale  coarse-to-fine  voting  approach  to  support  higher  resolution  voxel 
densities.  A  coarse  set  of  voxels,  with  fewer  cameras,  can  be  processed  very  quickly  to  yield  rough 
estimates  of  the  3D  structure.  The  size  of  the  voxels  and  number  of  cameras  can  be  iteratively  refined 
in  a  coarse-to-fine  approach  to  increase  the  resolution  of  the  voxel  space  and  improve  the  details  in  the 
point  cloud.  Algorithms  and  spatially  adaptive  data  structures,  like  octrees,  for  supporting  a  coarse  point 
cloud  as  an  initialization  to  refine  to  the  next  finer  voxel  density  level  need  to  be  developed  and  made 
computationally  efficient  as  well  as  scalable  using  out-of-core  methods. 

3.4.5  Multiview  Incremental  Vote  Accumulation  and  Removal 

Voting  all  along  a  ray,  from  its  entry  to  its  exit  across  the  voxel  grid,  leads  to  an  accumulation  of  votes 
throughout  the  vote  volume  and  contributes  to  a  significant  amount  of  noise  in  the  final  vote  volume 
impacting  downstream  processes  surface  detection  and  extraction.  This  vote  noise  increases  for  each  new 
view.  One  potential  approach  to  reduce  this  vote  noise  would  be  to  remove  some  of  the  votes  that  we  can 
readily  classify  as  noise  early  in  the  voting  process  -  that  is  subtracting  votes  from  one  view  (the  earliest  in 
a  group),  while  at  the  same  time  we  are  adding  votes  from  another  new  camera  (the  most  recent  view  in  a 
group).  This  leads  to  an  accumulate-remove  (or  add-subtract)  type  of  voting  algorithm  to  ensure  that  each 
voxel  receives  only  one  vote  (or  at  most  a  few)  per  frame.  We  propose  investigating  a  modified  voting 
algorithm  that  enforces  an  implicit  temporal  consistency  of  votes  and  filters  the  noise  from  older  frames 
or  views  while  at  the  same  time  the  algorithm  adds  votes  from  new  frames  or  views.  The  alternating 
phases  of  vote  accumulation  and  vote  removal  are  described  in  Algorithm  4.  Furthermore,  this  method 
also  enforces  that  each  voxel  can  receive  at  most  one  vote  per  frame,  regardless  of  the  voxel  resolution 
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and  feature  density  as  described  in  Algorithms  2  and  3.  Current  testing  using  adjacent  views  (one  second 
sampling)  will  be  extended  to  use  well  separated  views  such  as  four  second  sampling. 

Algorithm  2  Voting_Unique  (Currentlmagelndex) 

1 :  for  All  V oxel  do 

2:  Voxel. CurrentVote  —  false 

3:  end  for 

4:  Feature  Point  List  =  GetFeaturePointList(CurrentImageIndex) 

5:  Camera  Parameters  —  GetCamer  a  Parameters  {Currentlmagelndex) 

6:  for  each  FeaturePoint  in  FeaturePoint List  do 
7:  Ray  —  Get  Ray  (Feature  Point,  CameraParameters) 

8:  [Entry  Point,  Exit  Point]  =  GetRay  AndV  ox  el  Box  Inter  sections  (Ray ,  Vox  el  Box) 

9:  RayV  oxelList  —  Bresenham3D  (Entry  Point,  Exit  Point) 

10:  for  each  Voxel  in  RayV  oxelList  do 

11:  V  oxel .  CurrentV  ote  =  true 

12:  end  for 

13:  end  for 

14:  forAl l  Voxel  do 

15:  if  Voxel. CurrentV  ote  ——  true  then 

16:  Voxel.VoteCount  +  + 

17:  end  if 

18:  end  for 


Algorithm  3  Voting_Removal  (Currentlmagelndex) 

1 :  for  All  V oxel  do 

2:  Vox  el. CurrentV  ote  —  false 

3:  end  for 

4:  Feature  Point  List  =  Get  FeaturePoint  List  (Currentlmagelndex) 

5:  CameraParameters  —  GetC ameraP arameter s(CurrentImageIndex) 

6:  for  each  FeaturePoint  in  FeaturePoint  List  do 
7:  Ray  —  Get  Ray  (Feature  Point,  CameraParameters) 

8:  [Entry  Point,  Exit  Point]  =  GetRay  AndV  ox  el  Box  Inter  sections  (Ray ,  Vox  el  Box) 

9:  RayV  oxel  List  —  Br  esenhamSD  (Entry  Point,  ExitPoint) 

10:  for  each  Voxel  in  RayV  oxelList  do 

11:  if  Voxel.VoteCount  <  RepeatabilityThreshold  then 

12:  Vox  el. CurrentV  ote  =  true 

13:  end  if 

14:  end  for 

15:  end  for 

16:  for  All  Voxel  do 

17:  if  Voxel. CurrentV  ote  ——  true  then 

18:  V  oxel .  V  oteCount - 

19:  end  if 

20:  end  for 


3.4.6  Automatic  Point  Cloud  Extraction 

Our  automatic  point  cloud  extraction  requires  as  a  first  step  computation  of  the  vote  histogram  so  that 
we  can  transform  the  vote  to  a  normalized  value  for  each  voxel.  The  first  part  is  histogram  computation 
which  is  straightforward;  we  simply  compute  the  frequency  (pdf)  and  cumulative  (cdf)  histograms  of 
values  for  both  votes  and  the  associated  gradient  magnitude  estimates  which  provide  information  about 
scene  structure.  The  size  of  the  bins  is  set  to  one.  An  example  of  the  histograms  for  the  Albuquerque 
dataset  are  shown  in  Figure  13,  14,  and  15.  These  include  the  vote  and  vote  gradient  histograms  on  linear 
and  log  scales,  the  vote  and  vote  gradient  histograms  after  vote  extrusion  to  improve  building  structures, 
vote  gradient  histogram  after  vote  extrusion  and  Gaussian  smoothing  to  filter  out  noisy  votes  and  a  2D 
histogram  of  votes  versus  vote  gradients.  These  histograms  demonstrate  the  difficulty  in  trying  to  select 
a  single  manual  or  automatic  vote  threshold  to  separate  foreground  from  background  since  there  is  no 
bimodal  structure  between  the  two  classes  as  seen  in  Figure  19. 
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Algorithm  4  Alternating  Vote  Accumulation  and  Vote  Removal 

1:  for  Framelndex=  1  to  Repeat abilityT hr eshold  do 
2:  Voting_Uniqu  e(Framelndex) 

3:  end  for 

4:  for  Framelndex-Repeat  abilityT  hr  eshold  +  1  to  N  _Frames  do 
5:  \oting_RemoY2L\(FrameIndex  —  RepeatabilityT  hr  eshold) 

6:  Voting_Uniqu  e(Framelndex) 

7:  end  for 


Fusing  Information  for  Automatic  Vote  Thresholding:  The  number  of  votes  and  the  range  of  the 
gradient  magnitudes  depend  on  a  number  of  parameters  including  number  of  frames  used  for  voting  and 
the  grid  resolution.  In  case  of  automatic  thresholding  it  becomes  necessary  to  normalize  the  information 
contained  in  the  voxel  box.  Therefore,  we  use  the  histograms  to  define  a  low  threshold  below  which  the 
voxel  score  is  set  to  zero  and  a  high  threshold  above  which  the  score  is  set  to  one  and  use  a  sigmoid  or 
logistic  function  (see  Figure  17)  to  assign  a  normalized  score  between  0  and  1  to  any  value  between  the 
low  and  the  high  thresholds.  The  low  threshold  is  chosen  as  the  mean  value,  and  the  high  threshold  as  the 
value  such  as  only  2  percent  of  the  voxels  are  above  it.  Two  independant  scores  are  computed,  one  for  the 
number  of  votes  and  one  for  the  gradient  magnitude  and  for  each  voxel  the  two  scores  are  then  multiplied 
together  to  obtain  a  final  score  normalized  between  0  and  1.  Usually  we  retain  only  those  voxels  in  the 
final  point  cloud  with  a  value  above  0.5  to  0.8. 

3.5  Deriving  a  Depth  Map  Probability  Distribution  from  the  Vote  Volume 

Instead  of  applying  a  global  threshold,  the  vote  volume  can  be  converted  into  a  per  voxel  probability  of 
occupancy  likelihood  map,  from  which  we  can  then  derive  a  depth  distribution  along  each  ray.  The  order 
in  which  the  voxels  are  encountered  is  important  and  the  first  voxel  with  very  high  occupancy  probability 
should  correspond  to  the  maximum  of  likelihood  for  the  depth  value  or  location  of  the  surface,  even  if 
there  exist  peaks  at  voxels  with  higher  vote  value  further  along  the  ray.  We  can  formalize  this  constraint 
as  a  voxel  is  visible  from  a  position  if  and  only  if  all  the  voxels  in  between  are  not  occupied.  Therefore, 
given  the  probability  of  occupancy  for  each  voxel  in  between,  we  can  compute  the  probability  of  visibility 
as  the  product  of  all  preceding  voxel  probabilities  of  non-occupancy  as  shown  in  Equation  37.  Then,  the 
current  voxel  corresponds  to  the  feature  observed  in  the  image  if  and  only  if  that  voxel  is  visible,  and 
occupied.  We  consider  in  that  case  that  the  depth  of  the  ray  is  the  distance  from  the  camera  center  to  the 
current  voxel.  We  can  simplify  without  loss  of  generality,  using  Depth  =  i  where  i  is  the  voxel  index, 
varying  from  1  which  is  the  closest  to  the  camera  to  N  which  is  the  furthest  away  as  shown  in  Equation 
38.  Before  all  of  this,  the  probability  of  occupancy  is  computed  by  normalizing  the  vote  volume,  using 
a  sigmoid  as  shown  in  Figure  17  and  the  two  boundaries  corresponding  to  a  low  threshold,  taken  as  the 
average  number  of  votes,  below  which  the  probabiility  equals  0  and  a  high  threshold  that  corresponds  to 
the  top  2  percent  of  the  number  of  votes,  above  which  the  probability  is  set  to  1 . 


Pvisibility{t)  —  n<>-  POccupancy  ( j  )  ) 
j<i 

P {Depth  =  l)  =  Py isibilityit)  *  POccupancyiP) 


(37) 

(38) 


3.6  Combining  Depth  with  Appearance  Consistency 

It  is  possible  to  make  the  model  more  robust  by  adding  appearance  information.  Each  voxel  can  be 
projected  onto  several  images,  and  a  matching  score  can  then  be  computed  (using  either  cross  correlation 
or  a  descriptor  such  as  ORB  or  Daisy).  If  the  appearance  is  not  consistent  between  different  views,  it 
should  decrease  the  probability  that  this  voxel  is  occupied.  However,  we  do  not  want  the  consistency 
information  to  dominate  the  previously  computed  depth  probability.  So  the  appearance  matching  score  is 
normalized  in  the  range  a  G  [0, 1].  A  value  of  zero  corresponds  to  a  total  inconsistency  that  should  raise 
the  probability  of  non-occupancy  or  transparency  to  one.  A  high  value  for  consistency  on  the  other  hand 
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(a)  Histogram  of  votes  (vertical  axis  linear  scale)  (b)  Histogram  of  vote  gradient  magnitude  (linear  scale) 


Albuquerque:  Log  10  Votes  Histograms  Albuquerque:  Log  10  Gradient  Magnitude  Histograms 


(c)  Histogram  of  votes  (vertical  axis  log  scale)  (d)  Histogram  of  vote  gradient  magnitude  (log  scale) 


(e)  Histogram  after  vote  extrusion/collapsing  (linear  scale)  (f)  Histogram  after  vote  extrusion/collapsing  (log  scale) 
Figure  13:  Example  of  vote  space  and  vote  gradient  magnitude  histograms  for  the  Albuquerque  WAMI  dataset. 

should  decrease  the  probability  of  non-occupancy  or  transparency;  but  the  latter  should  still  depend  on  the 
voting  information.  We  therefore  propose  to  use  a  as  an  exponent  using  the  following  equations: 

Py isibilityiS)  =  J_(l  —  P()ccupancy(j ))  J  (39) 

j<i 

P (Depth  l)  PyisibilityiS)  *  [1  (1  POccupancy( 'O)  *]  (40) 
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(a)  Vote  gradient  mag  after  Gaussian  smoothing  (linear  scale)  (b)  Vote  gradient  mag  after  Gaussian  smoothing  (log  scale) 

Figure  14:  Gaussian  smoothing  the  vote  space  then  computing  gradient  magnitude  histograms  for  the  Albuquerque  WAMI  dataset  filters  out  a 
lot  of  noisy  votes. 


Figure  15:  2-D  histograms  showing  the  joint  relationship  between  the  vote  space  and  the  vote  gradient  magnitude  for  the  Albuquerque  WAMI 
dataset  before  and  after  Gaussian  smoothing  filters  out  noisy  votes. 


Histogram  —  Background  and  Foreground. 
Number  of  voxels  in  Background  =  1 948361 83 
Number  of  voxels  in  Foreground  =  6566018 


Histogram  —  Background  and  Foreground 
Number  of  voxels  in  Background  -  194836183 
Number  of  voxels  in  Foreground  -  656601 8 


Figure  16:  Example  of  the  background  vote  class  shown  in  blue  and  foreground  surface  reconstruction  class  shown  in  red  before  and  after  vote 
extrusion  operator  has  been  applied.  The  total  overlap  of  the  two  classes  indicates  the  difficulty  in  identifying  an  appropriate  threshold  to  separate 
the  two  classes. 


Equation  40  can  be  simplified  to: 


P (Depth  —  i )  —  Pvisibility(i)  Pvisibilityi}  H-  1) 


(41) 
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Sigmoid  function,  Lowth=10,  High  th=20 


Figure  17:  Example  of  a  shifted  sigmoid  or  logistic  function  that  is  used  to  transform  vote  values  into  a  surface  probability  map.  The  low 
threshold  is  10  so  that  any  vote  values  below  10  is  assigned  the  probability  0,  any  vote  value  above  20  is  assigned  a  probability  of  1  with  a 
continuous  transition  between  the  two  thresholds. 

3.7  Visualization  of  Point  Clouds  and  Meshes 
3.7.1  Point  Cloud  Coloring 

Once  the  regularized  point  cloud  geometry  has  been  estimated  with  improved  surface  details  and  holes 
in  surfaces  removed,  we  need  to  color  voxels  in  the  point  cloud  for  rendering  visually  accurate  surface 
appearance  estimates  that  can  be  used  for  visualization  and  other  applications  like  improving  tracking. 
Several  algorithms  were  investigated  since  some  of  the  initial  coloring  algorithms  were  extremely  slow 
taking  on  the  order  of  several  days.  Consistent  colors  depend  on  an  accurate  estimate  of  surfaces,  as¬ 
sociated  surface  normals  and  material  properties.  Algorithm  5  selects  the  voxel  color  from  the  view  that 
minimizes  the  angle  between  the  ray  and  the  local  surface  normal.  Algorithm  6  strives  for  view  consistency 
and  uses  as  far  as  possible  voxel  colors  from  the  same  view.  A  fast  coloring  algorithm  using  a  z-buffer 
type  occlusion  check  is  being  designed  to  further  improve  the  estimation  of  surface  colors  (Algorithm  7). 

Algorithm  5  was  the  very  first  coloring  method  that  was  implemented  supporting  different  parameters. 
All  the  images  first  need  to  be  loaded  into  memory  which  is  why  we  could  only  use  it  with  the  Brown 
University  test  datasets.  The  Transparent  Sky  images  were  too  large  to  fit  into  memory.  The  input  must  be 
a  point  cloud  with  the  calculated  normal  for  each  point.  During  Summer  2013  we  were  using  Meshlab  to 
estimate  the  normals.  The  views  that  minimize  the  dot  product  between  the  ray  and  the  normal  are  chosen 
for  coloring.  If  NSelectedView  =  1,  then  only  one  view  is  picked  per  point,  which  is  what  we  tried  the 
first  time.  This  causes  some  obvious  discontinuities  in  the  final  results,  where  a  change  of  view  occurs. 
If  N SelectedV iew  >  1,  then  the  color  assigned  to  each  point  is  somehow  summarizing  the  colors  read 
from  several  images.  Hence  the  Statistic  function  below  on  line  14  can  be  the  mean,  the  median  or  the 
mode  for  example.  The  experiments  we  did  used  the  mean,  and  if  we  were  seeing  fewer  discontinuities, 
the  rendering  had  a  blurry  effect.  Importantly,  this  method  does  not  account  for  occlusion. 


Algorithm  5  Coloring  Method  1 :  Select  the  view  minimizing  the  angle  between  the  ray  and  surface 

1:  for  Point  Index  —  1  :  N  Points  do 
2:  for  View  Index  =  1  :  NView  do 

3:  Ray-PointCoordinate{Point  Index)  —  CameraPosition{V  iew  Index) 

4:  N=N  or  mal  (Point  Index) 

5:  D  ot  P  roduct  Array  (View  I  ndex)=Dot_Product{N,  Ray) 

6:  end  for 

7:  [SortedArray  S  ortedV  iew  I  ndex]=Sort{Dot  Product  Array ,  increase) 

8:  BestV  iew  List=S  ortedV  iew  Index{l  :  NSelectedView) 

9:  for  View  Index  =  1  :  NSelectedView  do 

10:  SelectedV  iew  Index -BestV  iew  List  {View  Index) 

11:  [Px  Py]=Back  Project  {Point  Index,  SelectedV  iew  Index) 

12:  Color  Array  {View  I  ndex)= I  mag  e{S  elect  edV  iew  Index,  Px ,  Py) 

13:  end  for 

14:  C  olor  {PointIndex)=Statistic{C  olor  Array) 

15:  end  for 


Algorithm  6  is  the  method  we  have  been  using  since  the  first  results  on  Albuquerque  (Fall  2013)  and 
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that  we  are  continuing  to  use  currently.  The  idea  is  to  color  as  many  point  as  possible  using  image  colors 
from  the  same  view.  Images  are  loaded  one  by  one,  and  all  points  are  projected  on  it.  If  the  occlusion 
test  is  negative  and  that  the  projection  is  within  the  image  frame,  the  pixel  color  is  assigned  to  the  point, 
which  is  marked  as  colored  and  will  not  be  subsequently  updated  again.  This  method  gives  an  impressive 
rendering  if  the  point  cloud  is  seen  from  positions  close  to  the  first  image’s  position.  However,  the  point 
cloud  color  quality  significantly  decreases  as  we  get  further  from  the  initial  position  especially  with  low 
sun-angles.  An  additional  drawback  of  his  method  is  that  it  has  relatively  slow  performance  since  the 
check  for  occlusions  requires  the  use  of  discrete  voxel-based  ray-casting. 


Algorithm  6  Coloring  Method  2:  Use  as  much  as  possible  the  same  image  for  neighboring  pixels 

1:  ImageSubset  —  ExctractlmageSubsetQ 
2:  for  Imagelndex  =  1  :  N ImagesInSubset  do 
3:  for  Point  Index  =  1  :  N  Points  do 

4:  if  I sCol or ed(Point Index)  ==  false  then 

5:  V oxelList  =  Bresenham3D  (Cur  rent  Point,  CameraC enter,  V  ox  el  Box) 

6:  if  AllV  ox  els  Are  Empty  (Vox  el  List)  ——true  then 

7:  ImageProjection  =  GetImageProjection(ImageIndex,  Pointlndex) 

8:  Col  or  (Point  Index)  =  ImageProjection. Color 

9:  I sC ol  or  ed(Point  Index)  —  true 

10:  end  if 

1 1 :  end  if 

12:  end  for 

13:  end  for 


Algorithm  7  should  produce  a  result  very  close  to  the  results  given  by  Algorithm  6,  but  is  a  lot  faster 
as  it  does  not  need  to  call  the  (Bresenham)  ray-casting  function  to  enforce  the  occlusion  constraint.  The 
difficulty  is  to  correctly  fix  the  depth  map  resolution.  Points  in  the  point  cloud  do  not  have  any  radius  and 
their  dimension  is  therefore  regulated  by  the  depth  map  parameters.  If  this  resolution  is  too  high,  the  depth 
map  will  be  sparse,  and  the  occlusion  check  inaccurate.  If  the  depth  map  resolution  is  too  low,  then  there 
will  be  more  discarded  points  for  occlusion  than  there  should  have  been.  This  method  can  also  include  a 
cost  minimization  component.  Two  types  of  experiments  will  be  done,  one  where  the  function  cost  is  the 
depth,  and  another  where  it  will  be  the  gradient  magnitude  of  the  depth. 

3.7.2  Converting  Point  Clouds  to  Mesh  Representation 

Once  we  have  a  sparse  or  dense  point  cloud,  we  can  construct  a  triangulated  mesh  from  the  point  cloud  to 
extract  high  quality  surfaces  in  the  aerial  scene  that  can  be  used  to  improve  the  quality  of  the  reconstruction 
and  segment  building  structures.  Similar  to  the  challenges  we  have  in  tracking  for  large  scale  aerial 
imagery,  accurate  modeling  is  a  difficult  task  due  to  severe  occlusions,  highly  reflective  surfaces,  varying 
illumination  conditions,  registration  errors  and  sensor  noise.  On  the  other  hand,  the  modeling  performance 
is  highly  dependent  on  the  accuracy  of  the  generated  point  clouds  and  the  presence  of  outliers.  There  are 
several  volumetric  methods  which  could  be  used  for  surface  extraction  that  can  scale  up  to  the  large  grid 
size  of  urban  scenes.  Marching  cube  is  one  of  the  well-known  surface  meshing  algorithms  that  is  based 
on  polygonizing  the  grid  and  is  highly  parallelizable  which  makes  it  a  good  candidate  to  convert  point 
clouds  to  meshes  in  real-time.  The  space  complexity  of  triangle  vertices  and  faces  lists  include:  (i)  3 
x  sizeof(float)  for  xyz,  (ii)  3  x  sizeof(int)  for  RGB,  (iii)  3  x  sizeof(float)  for  normals,  and  (iv)  3  x 
sizeof(int)  for  triangle  faces.  Storing  all  of  this  information  requires  an  efficient  data  structure  for  the 
large  scale  volumes  of  our  reconstructed  urban  scenes.  The  data  structure  must  also  have  the  property 
of  efficient  dynamic  updates  in  order  to  insert  new  (unique)  triangle  vertices.  Using  a  hash-table  does 
not  appear  to  be  scalable  in  terms  of  fast  access  due  to  the  large  number  of  vertices.  We  have  tried  three 
different  approaches  using  hash  tables  including  sorted  and  unsorted  maps,  which  are  C++  templated  data 
structures,  and  also  a  separate  hash  table  to  map  the  unique  grid  edges  and  its  corresponding  cut-point 
which  technically  represents  a  triangle  vertex.  In  the  proposed  new  approach  we  use  a  self-adjusting  splay 
tree  with  the  assumption  that  the  most  recently  accessed  elements  will  most  probably  be  accessed  again 
making  the  splay  tree  an  appropriate  data  structure  to  consider.  Therefore  in  the  process  of  polygonizing 
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Algorithm  7  Coloring  Method  3:  Fast  Coloring  using  a  Z-Buffer  type  occlusion  check 

1:  ImageSubset  =  ExctractlmageSubsetQ 

2:  DepthMap=assign  <  float  >  (DepthMapWidth,  Depth  Map  Height) 

3:  IndexMap-assign  <  int  >  (DepthMapWidth,  Depth  Map  Height) 

4:  Cost=assign  <  float  >  (N Points) 

5:  Cost.fill(INF) 

6:  for  Imagelndex  =  1  :  N ImagesInSubset  do 
7:  DepthMap.fill(INF) 

8:  IndexMap.fill(— 1) 

9:  Currentlmage  m  Loadlmage(Imagelndex) 

10:  for  Point  Index  =  1  :  N  Points  do 

1 1 :  [px  py  Depth]=Back  Project  (Point  Index ,  Imagelndex) 

12:  [x  y]=Image2DepthMap(px,py) 

13:  if  Depth  <  DepthMap(x ,  y)  then 

14:  DepthMap(x ,  y)  =  Depth 

15:  IndexMap(x,y )  =  Pointlndex 

16:  end  if 

17:  end  for 

18:  fora:  =  1  :  DepthMapW idth  do 

19:  for  y  =  1  :  DepthMapH eight  do 

20:  if  IndexMap(x,  y)\  =  —  1  then 

21:  CurrentPointIndex=IndexMap(x ,  y) 

22:  CurrentC  ost=C  ostFunction(x ,  ?/) 

23:  [px  py]=DepthMap2Image(x,y) 

24:  if  CurrentCost  <  Cost  (Cur  rent  Point  Index)  then 

25:  Cost  (Cur  rent  Pointlndex)  —  CurrentCost 

26:  Col  or  (Current  Point  Index)  —  Currentlmage(x,y).  Color 

27:  end  if 

28:  end  if 

29:  end  for 

30:  end  for 

31:  end  for 


the  grid  cells  the  most  recent  inserted  vertices  will  be  kept  in  the  first  levels  of  the  tree  and  will  boost 
the  vertex  search  performance  if  it  has  been  already  added  before.  Mesh  simplification  and  thickening, 
hole  closing  and  texturing  are  the  other  issues  that  need  to  be  addressed  after  reconstructing  the  building 
facades.  One  solution  to  fill  the  holes  is  to  merge  adjacent  grid  cells.  In  this  way  we  may  lose  some 
structure  information  (like  small  windows)  during  merging.  This  can  be  avoided  if  we  add  some  additional 
contextual  feature  and  structure  information  as  constraints  during  the  merge  process.  An  example  mesh 
reconstructed  from  a  portion  of  the  Los  Angeles  point  cloud  is  shown  in  Figure  18. 


Figure  18:  Triangular  meshing  of  the  reconstructed  3D  point  cloud  for  a  small  portion  of  the  Lost  Angeles  dataset. 
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3.7.3  Out-of-core  Spatial  Data  Structures  for  Scalability  to  Large  Point  Clouds 

Extremely  large  point  clouds  will  usually  exceed  the  primary  memory  of  most  computers,  especially  lap¬ 
tops  and  mobile  devices.  In  order  to  process  and  display  such  massive  point  clouds  an  out-of-core  strategy 
is  needed.  The  proposed  approaches  serialize  the  in-memory  streaming  octree  to  secondary  storage.  One 
method  for  serializing  the  streaming  octree  to  disk  leverages  the  hierarchical  nature  of  the  file  system. 
Each  node  in  the  octree  can  be  represented  by  a  directory  in  the  filesystem.  Each  child  node  (sub-octree) 
is  a  subdirectory  of  the  parent.  Additionally,  each  directory  would  contain  a  single  file  with  the  actual 
point  cloud  data. 

Top-most  node  of  a  streaming  octree  showing  child  nodes  as  subdirectories  and  the  data  file  containing 
the  coarsest  sampled  view  of  the  entire  point  cloud.  The  figure  shows  one  level  of  a  serialized  streaming 
octree.  The  data  file  contains  a  fixed  number  of  points  from  the  point  cloud  representing  a  sampled  view  of 
the  current  octant.  Each  subdirectory  represents  a  nested  octree  containing  possibly  more  children  but  also 
another  collection  of  points  that  refine  the  parent.  If  any  of  the  eight  sub-octrees  are  empty,  then  no  subdi¬ 
rectory  will  exists;  this  is  the  serialized  equivalent  of  a  null  child  from  the  in-memory  octree.  Traversing 
this  serialized  octree  is  the  same  as  traversing  the  in-memory  version.  Here  pointers  to  nested  octrees  are 
just  directories.  The  data  file  itself  contains  the  point  positions,  colors,  normal,  votes,  histograms,  etc.  The 
data  files  could  be  compressed  to  reduce  storage  and  improve  read  performance.  A  single  metadata  file 
for  the  entire  octree  (not  shown)  is  all  that  is  needed  in  addition  to  the  file  hierarchy.  This  metadata  file 
should  contain  the  bounding  box  of  the  entire  point  cloud,  which  point  attributes  are  contained  in  the  data 
files  and  their  order,  and  optional  compression  type. 

Unlike  the  above  approach  which  requires  many  directories  and  files,  a  second  method  for  serializing 
the  streaming  octree  produces  a  single  file.  This  approach  represents  pointers  to  nodes  as  offsets  in  the  file. 
Following  metadata  describing  the  point  data  (attributes,  compression,  and  maximum  extents),  an  offset 
to  the  top  most  node  is  provided.  This  offset  is  absolute  indicating  how  many  bytes  from  the  beginning  of 
the  file. 

Each  node  contained  in  the  file  contains  eight  file  offsets  indicating  where  in  the  file  those  child  nodes 
reside  on  disk.  Following  the  offsets  to  child  nodes  is  the  actual  point  cloud  data  (possibly  compressed) 
for  the  current  node.  For  any  empty  octants,  the  file  offset  will  be  the  special  value  zero;  because  of  the 
header  and  metatdata  no  nodes  can  exist  at  the  very  beginning  of  the  file.  This  zero  offset  parallels  a  null 
pointer  of  the  in-memory  octree.  This  approach  is  nearly  identical  to  the  in-memory  data  structure  except 
instead  of  memory  pointers  to  sub-octrees,  the  pointers  are  represented  as  file  offsets. 

An  important  consideration  for  both  approaches  to  serializing  the  data  to  disk  is  that  none  of  the 
octants  are  directly  available  for  reading;  rather  finding  the  data  to  read  requires  traversing  several  steps 
either  through  the  file  system  or  several  seeks  in  the  single  file  approach.  While  this  might  be  considered  a 
disadvantage  this  actually  closely  mirrors  the  process  used  when  displaying  the  point  cloud  data  in  Stratus. 
No  single  node  contains  all  the  data  for  a  given  octant.  Each  parent  contains  a  coarse  view  and  the  children 
nodes  refine  that  data  of  the  parent.  To  draw  all  points  contained  in  an  octant,  all  the  points  from  the  root 
to  the  leaf  node  is  required.  This  hierarchical  refinement  is  at  the  core  of  efficient  culling  and  continuous 
level-of-detail  display  of  data  in  Stratus. 

3.7.4  Point  Cloud  Refinement 

In  order  to  improve  the  understanding  of  3D  structures  in  the  scene  extraction  of  building  footprints,  the 
ground  plane  and  identifying  the  connected  components  related  to  buildings  will  be  necessary.  We  use  a 
plane  fitting  approach  with  RANSAC  to  filter  out  noise  in  identifying  the  largest  fitting  plane  to  identify 
the  ground  plane  and  then  subsequently  identifying  the  various  separate  buildings  in  the  scene.  Some 
preliminary  results  in  this  direction  are  shown  in  Figure  19. 

3.8  Moving  Vehicle  Detection  using  Semantic  Depth  Map  Fusion 

Automatic  moving  object  detection  and  segmentation  is  one  of  the  fundamental  low-level  tasks  for  many 
of  the  urban  traffic  surveillance  systems  including  traffic  monitoring,  activity  and  behavior  recognition 
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Figure  19:  Initial  segmentation  of  building  structures  using  RANSAC-based  plane  fitting  for  a  portion  of  Albuquerque  (left)  and  Los  Angeles 
(right). 

and  tracking  applications.  We  developed  an  automatic  moving  vehicle  detection  system  for  wide  area 
aerial  imagery  based  on  semantic  fusion  of  flux  trace  and  building  mask  information.  Flux  trace  provides 
spatio-temporal  information  of  moving  edges  including  false  detected  moving  buildings  caused  by  paral¬ 
lax  effects.  We  deployed  building  depth  information  to  filter  out  motion  detection  due  to  parallax  effect 
from  actual  moving  vehicles  on  the  ground.  The  coarse  building  depth  map  were  guided  towards  the  actual 
building  boundaries  using  a  level-set  geodesic  active  contour  framework.  An  average  precision  of  90% 
and  recall  of  80%  have  been  reported  for  200,  2k  x  2k  cropped  region  of  interest  from  Albuquerque  urban 
imagery. 

3.8.1  Introduction 

Automated  video-based  supervision  systems  known  as  video  surveillance  systems  are  essential  for  moni¬ 
toring  of  indoor  or  outdoor  scenes  and  activities.  An  efficient  management  of  transport  networks,  traffic- 
signal  monitoring,  protection  of  infrastructures,  border  control,  health  care  monitoring  applications  and 
many  others  [177,  125].  Automatic  moving  object  detection  and  segmentation  is  one  of  the  fundamental 
low-level  tasks  for  many  of  urban  traffic  surveillance  system  applications  including  traffic  monitoring[128], 
activity  and  behavior  recognition  [193,  145,  76,  51],  change  detection  [200],  classification  [144,  143,  194, 
74,  192]  and  in  particular  detection  and  tracking  [213,  162,  219,  36,  177,  111,  164,  179,  190,  101,  55,  150, 
165,  176]. 

Detection  and  segmentation  of  moving  objects  from  stationary  background  is  challenging/complicated 
due  to  background  variance,  illumination  changes,  shadow  or  varying  camera  viewpoints  [125,  101,  163, 
69].  Detection  and  tracking  in  airborne  urban  traffic  surveillance  is  impacted  by  many  more  difficul¬ 
ties  such  as  small  and  low  resolution  targets,  large  moving  object  displacement  due  to  low  frame  rate, 
congestion  and  occlusions,  motion  blur  and  parallax  effect,  camera  vibration  and  exposure  and  varying 
viewpoints  [185,  200,  70,  55,  128,  71,  169,  166,  160]. 

A  typical  moving  object  detection  system  follows  either  an  appearance-based  or  motion-base  ap¬ 
proaches.  Moving  vehicle  detection  algorithms  typically  focus  on  motion-based  detection  methods  [48, 
190,  72]  since  appearance-based  methods  [31,  192]  are  mostly  computationally  expensive  particularly 
when  applied  to  large  scale  aerial  imagery.  However,  many  recent  methods  deploy  fusion  schemes  to 
combine  the  strength  of  each  individual  detection  technique  and  improve  system  robustness  [125,  109,  72, 
213,210,  119,  77]. 

In  this  project,  we  developed  a  motion-based  detection  technique  exploiting  fusion  of  flux  trace  spatio- 
temporal  information  and  building  depth  map  to  enhance  the  robustness  and  filter  out  false  responses 
caused  by  tall  buildings  parallax  effects.  The  coarse  building  masks  were  refined  and  guided  to  the  actual 
building  boundaries  using  a  level-set  geodesic  active  contour  framework  to  save  the  detected  moving 
vehicle  next  to  the  buildings.  We  used  the  state  of  the  art  registration  algorithm  called  MU  BA4S  in  order 
to  orthorectified  image  sequences  in  a  global  reference  system  to  maintain  the  relative  movement  between 
the  moving  camera  platform  and  the  fixed  scene  [20,  22].  Figure  20  illustrates  our  proposed  semantic 
fusion-based  moving  vehicle  detection  system  pipeline. 
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Figure  20:  Semantic  fusion-based  moving  vehicle  detection  system  pipeline. 

3.8.2  Orthorectification 

Videos  in  aerial  imagery  are  captured  on  a  moving  airborne  platform.  Detection  of  moving  objects,  e.g. 
vehicles,  in  a  scene  which  is  observed  by  a  camera  that  by  itself  has  large  movement  and  big  jitters  can  be 
extremely  challenging.  To  address  this  problem,  images  from  camera  planes  are  orthorectified  (registered) 
in  a  global  reference  system  to  maintain  the  relative  movement  between  the  moving  platform  and  the  scene 
fixed.  In  this  work  the  registration  has  been  carried  out  by  applying  a  homography  transformation  between 
each  image  plane  and  the  ground  dominant  plane  of  the  scene.  Such  homography  transformations  are 
analytically  obtained  using  camera  parameters,  i.e.  their  rotation  matrices  and  translation  vectors.  First 
the  noisy  camera  parameters  (referred  to  as  metadata )  obtained  from  platform  sensors  (i.e.  IMU  and  GPS) 
are  refined  by  a  fast  Structure-from-Motion  algorithm(BA4S),  proposed  in  [20,  22].  After  the  refinement 
process,  the  homography  matrix  between  the  ground  plane  n  and  the  image  plane  of  camera  C  is  computed 
as  follows: 

cHn  =  K  [ri  r2  t]  (42) 

ri  and  T2  being  the  first  and  second  columns  of  the  rotation  matrix  from  the  world’s  to  the  camera  local 
coordinate  system,  t  is  the  corresponding  translation  vector,  and  K  is  the  camera  calibration  matrix.  As  a 
result,  a  2D  homogeneous  image  point  x  from  camera  C  can  project  back  on  n  as: 

=  CH“1  X  =  nnc  X  (43) 

where  c  is  the  inverse  of  the  homography  cH7r  in  42.  Such  a  homography  transformation  is  valid  to 
transform  all  points  between  the  image  and  ground  plane,  provided  that  their  corresponding  3D  point  lies 
on  the  ground  plane.  Otherwise  applying  this  homography  transformation  for  points  off  the  ground  plane 
would  cause  a  phenomenon  known  as  parallax  (see  [9  ]  for  more  details). 

In  this  part  of  the  project,  we  conducted  our  experiments  on  ABQ  aerial  imagery  which  were  collected 
by  TransparentSky  [5]  over  downtown  Albuquerque,  NM  and  ortho-rectified  using  MU  BA4S  state-of-the- 
art  registration  approach.  Figure  21  shows  samples  of  raw  and  geo-registered  ultra  high  resolution  images 
(6400  x  4400).  However,  we  worked  on  a  cropped  2k  x  2k  region  of  interest  for  which  the  ground-truth 
are  provided  by  our  collaborator  at  Kitware  (Fig.  21(BR)). 

3.8.3  Depth  Maps  in  Orthorectified  Projections 

The  output  of  our  MU  BA4S  bundle  adjustment  produces  refined  camera  parameters  and  a  sparse  3D  point 
cloud  [20,  22,  19].  Sparse  point  clouds  can  be  improved  to  produce  dense  point  clouds  using  different 
multi-view  stereo  algorithms  like  PMVS[8!  ],  VisualSFM/Bundler[218]  or  probabilistic  volume  method 
described  in  [57].  The  dense  3D  point  clouds  are  then  used  to  produce  depth  or  altitude  maps  by  projecting 
the  3D  points  into  each  camera  view.  Each  point  is  projected  on  a  grid  on  the  image  plane,  and  after 
taking  occlusion  into  account,  each  pixel  in  the  grid  is  assigned  at  most  one  point  in  the  point  cloud. 
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Figure  21:  (UL)  Illustrates  raw  ultra  high  resolution  images  30  MB,  each)  collected  from  an  airborne  platform  flying  over  downtown 
Albuquerque-NM.  (UR)  corresponding  raw  image  depth  mask  (LL)  shows  the  corresponding  registered  images  exploiting  MU  BA4S  orthrectifi- 
cation  approach,  (LR)  corresponding  orthorectified  depth  mask 

The  altitude  value  of  this  point  is  used  as  the  intensity  for  the  corresponding  pixel.  This  way  we  can 
produce  a  mask  for  each  view  such  as  the  one  shown  in  Figure  21(LL).  Then  same  homography  that 
was  applied  to  original  image  can  be  used  for  the  altitude  mask  to  obtain  the  estimated  altitude  of  every 
pixel  in  the  ortho-rectified  image,  as  shown  in  Figure  21(LR).  Frame  specific  depth  maps  provide  valuable 
information  for  distinguishing  between  motion  detections  due  to  motion  parallax  (motion  of  tall  structures 
due  to  viewpoint  changes)  and  moving  objects  on  the  ground.  The  semantic  fusion  method  for  combining 
depth  and  motion  is  discussed  in  more  detail  in  Section  3.8.6. 

3.8.4  Motion  Detection 

The  developed  framework  uses  flux  tensor  to  extract  motion  blobs  and  to  refine  building  mask.  Flux 
tensor  is  presented  as  an  extension  of  3D  structure  tensor  that  allows  reliable  motion  segmentation  without 
expensive  eigenvalue  decomposition  [45,  157].  Following  section  briefly  describes  flux  tensor  structure 
and  computations. 


3.8.5  Flux  Tensors 


Flux  tensor  provides  reliable  detection  of  moving  structures  by  computing  temporal  variations  of  the 
optical  flow  field  within  the  local  3D  spatiotemporal  volume.  Under  constant  illumination  model,  optic- 
flow  equation  of  a  spatiotemporal  image  volume  I(x)  centered  at  location  x  =  [x,  y ,  t\  [100]  is 


dl(x) 

dt 


<9I(x) 

dx 
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dl(x) 

dy 


VV  + 


dl(x) 
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VIT(x)  v(x) 
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taking  the  derivative  of  Eq.  44  with  respect  to  t,  we  obtain  Eq.  45 


d_  f  ril(x)  \ 
dt  \  dt  ) 
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which  can  be  written  in  vector  notation  as, 


^(VIT0r)v(x)) 


aviT(x) 

dt 


v(x)  +  VIT(x)  a(x) 


(45) 


(46) 


where  v(x)  =  [vx,  vy,  r/J  is  the  optic-flow  vector  and  a(x)  =  [ax ,  ay,  at]  is  the  acceleration  of  the  image 
brightness  located  at  x.  Usually  v(x)  is  estimated  by  minimizing  Eq.  46  over  a  local  3D  image  patch 

O(x.y): 

5V^(X)  v(x)  +  VIT(x)  a(x)  =  0  (47) 

Assuming  a  constant  velocity  model  subject  to  the  normalization  constraint  1 1  v(x)  1 1  =  1  and  consequently 
zero  acceleration,  a  least-squares  error  measure  q5(x)  (Eq.  48)  is  used  to  minimize  the  Eq.  47 


e/s(x) 


L,y)  v(x))  dy 
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(48) 


Differentiation  of  q5(x)  with  respect  to  v,  leads  to  eigenvalue  decomposition  problem  Jp(x)  v(x)  = 
A  v(x).  The  3D  flux  tensor  Jf  for  the  spatiotemporal  volume  centered  at  (x,  y )  can  be  written  in  expanded 
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The  elements  of  the  flux  tensor  incorporate  information  about  temporal  gradient  changes  which  leads  to 
efficient  discrimination  between  stationary  and  moving  image  features.  Thus  the  trace  of  the  flux  tensor 
matrix  which  can  be  compactly  written  and  computed  as,  trace(JF)  =  1 1  ^ VI | \2dy  can  be  directly 
used  to  classify  moving  and  non-moving  regions  without  the  need  for  expensive  eigenvalue  decomposi¬ 
tions. 


3.8.6  Fusion  of  Flux  Trace  and  Depth-map  Information  Using  Edge  Feature  Indicator  Functions 

As  described  in  Section  3.8.4,  each  pixel  is  classified  as  moving  versus  stationary  by  thresholding  trace  of 
the  corresponding  flux  tensor  matrix  (trace(JF)).  However,  approximately  70%  of  the  motion  detection 
mask  are  induced  by  parallax  motion  affects  of  tall  structures  (Fig.  54).  We  develop  a  context-based 
semantic  fusion  approach  to  reject  such  false  detections  due  to  tall  structures  by  using  the  depth  map 
information  in  a  boundary  refinement  and  filtering  processing  stage.  As  it  is  described  in  Section  3.8.3, 
the  approximate  altitude  of  every  pixel  in  the  ortho-rectified  image  can  be  computed  from  dense  3D 
reconstruction.  In  order  to  produce  the  building  mask,  image  depth  map  is  thresholded.  We  identify  all 
the  pixels  with  altitude  values  greater  than  a  defined  threshold  as  (greater  than  20  meters  for  this  sequence) 
as  tall  structures  which  will  be  removed  from  flux  tensor  motion  mask  (Fig.  55).  Figure  22  illustrates  the 
true  motion  detection  produced  by  flux  tensor  (in  yellow  color)  and  false  detection  caused  by  parallax 
in  white  color.  The  high  altitude  areas  which  are  filtered  by  building  mask  shown  in  blue.  Provided 
Ground-truth  bounding-boxes  are  shown  in  red  to  visually  evaluate  the  performance  of  the  detection. 

2D  depth  maps  are  projected  from  3D  point  clouds  obtained  by  3D  reconstruction  of  the  scene  (Sec¬ 
tion  3.8.3).  These  point  clouds  have  lower  resolution  compared  to  the  analyzed  images.  Low  resolution 
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Figure  22:  illustration  of  motion  detection  results:  true  motion  detection  produced  by  flux  tensor  (in  yellow  color)  and  false  detection  caused  by 
parallax  in  white  color.  The  high  altitude  areas  which  are  filtered  by  building  mask  shown  in  blue.  Provided  Ground-truth  bounding-boxes  are 
shown  in  red  to  visually  evaluate  the  performance  of  the  detection. 


combined  with  2D  projection  inaccuracies  may  result  in  filtering  out  correctly  detected  vehicles  positioned 
close  to  tall  structure  (Fig.  22). 

In  order  to  refine  the  coarse  building  map  B^map,  we  proposed  to  fuse  the  high  resolution  moving 
edge  information  from  flux  tensor  trace  with  building  depth  mask  through  a  level- set  based  geodesic 
active  contours  framework. 

3.8.7  Building  Mask  Refinement  Using  Level  Set  Based  Active  Contours 

The  flux  trace  is  used  to  construct  an  edge  indicator  function  gp  which  will  guide  and  stop  the  evolution 
of  the  geodesic  active  contour  when  it  arrives  at  tall  structure  boundaries, 

9Htrace(Jr))  =  1  +  trle(Jr)  <5°> 

The  edge  indicator  function  is  a  decreasing  function  of  the  image  gradient  that  rapidly  goes  to  zero  along 
building  edges  and  holds  high  values  elsewhere. 

Active  contours  evolve  a  curve  C,  subject  to  constraints  from  a  given  image.  In  level  set  based  active 
contour  methods  the  curve  C  is  represented  implicitly  via  a  Lipschitz  function  (f>  by  C  =  {(x,  y)  \c/)(x,  y)  = 
0},  and  the  evolution  of  the  curve  is  given  by  the  zero-level  curve  of  the  function  cf)(t ,  x,  y).  Evolving  C  in 
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a  normal  direction  with  speed  F  amounts  to  solving  the  differential  equation  [52], 


—  =  |  V0|F;  0(0,  x,  y )  =  4>0{x,  y )  (51) 

Unlike  parametric  approaches  such  as  classical  snake,  level  set  based  approaches  ensure  topological  flex¬ 
ibility  since  different  topologies  of  zero  level-sets  are  captured  implicitly  in  the  topology  of  the  energy 
function  </>.  Topological  flexibility  is  crucial  for  our  application  since  we  want  to  guide  the  coarse  thresh- 
olded  building  mask  to  the  actual  building  contours  and  reveal  the  filtered  moving  vehicles  next  to  build¬ 
ings.  We  used  geodesic  active  contours  [4(  ]  that  are  effectively  tuned  to  Flux  trace  edge  information.  The 
level  set  function  <fi  is  initialized  with  the  signed  distance  function  of  the  coarse  building  mask  {B^map) 
and  evolved  using  the  geodesic  active  contour  speed  function, 

^  =  5F(trace(JF))(c  +  /C(0))|V0|  +  V0  •  V5F(trace(JF))  (52) 


where  gp (trace (Jf))  is  the  fused  edge  stopping  function  (Eq.  50),  c  is  a  constant,  and  /C  is  the  curvature 
term, 
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The  force  (c  +  1C)  acts  as  the  internal  force  in  the  classical  energy  based  snake  model.  In  this  work,  the 
constant  velocity  c  pushes  the  curve  inwards  to  the  tall  structures.  The  regularization  term  K  ensures 
boundary  smoothness.  The  external  image  dependent  force  gp (trace (Jf))  is  used  to  stop  the  curve 
evolution  at  building  boundaries  edges.  The  term  VgF  *  V</>  introduced  in  [49]  is  used  to  increase  the 
basin  of  attraction  for  evolving  the  curve  to  the  boundaries  of  the  objects. 

Figure  23  shows  the  improved  building  contours  results  in  red.  The  blue  line  are  the  initial  building 
contours  which  are  evolved  and  stopped  at  building  actual  boundaries.  As  it  can  be  seen,  the  previously 
filtered  detected  cars  by  initial  building  mask  are  revealed  and  counted  as  true  detections. 


3.8.8  LoFT:  Likelihood  of  Features  Single  Object  Tracking  Framework 

Likelihood  of  Features  Tracking  (LoFT)  [167]  is  our  original  recognition-based  target  tracking  system 
initially  designed  for  vehicle  tracking  in  low  frame  rate  wide  area  motion  imagery,  later  also  used  in 
full  motion  video  analysis.  LoFT  uses  a  rich  feature  set  describing  intensity,  edge,  shape  and  texture 
information  (Figure  24). 

Target  to  search  window  feature  comparison  is  performed  using  cross-correlation  and  sliding  window 
histogram  differencing  using  an  efficient  integral  histogram  computation  scheme.  The  process  produces 
a  likelihood  map  for  each  individual  feature.  Different  features  are  more  sensitive  or  more  robust  against 
different  target  characteristics  and  environmental  conditions.  Fusing  different  feature  likelihood  maps 
enables  adaptation  of  the  tracker  to  scene  dynamics  and  target  appearance  variabilities.  LoFT  uses  two 
likelihood  map  weighting  schemes,  variance  ratio  (VR)  [56]  for  histogram-based  features  and  distractor 
index  (DI)  [156]  for  correlation-based  features.  LoFT  appearance  adaptation  scheme  maintains  and  up¬ 
dates  a  single  template  by  calculating  affine  changes  in  the  target  to  handle  orientation  and  scale  changes 
[167].  LoFT  target  template  update  is  performed  continuously  to  ensure  accurate  target  localization. 

3.9  Dynamic  Scene  Analysis 

On  this  project,  the  goal  was  to  perform  dynamic  3D  scene  analysis.  To  that  end,  beside  structural  scene 
analysis,  we  have  focused  on  moving  object  tracking  for  analysis  of  scene  dynamics.  Tracking  is  the 
process  of  estimating  the  locations  of  objects  of  interest  over  time.  Despite  all  the  recent  advancements 
in  computer  vision,  visual  tracking  remains  to  be  a  challenging  task  because  of  the  real-world  conditions 
such  as  partial  or  full  occlusions,  background  clutter,  shadow  and  other  illumination  artifacts. 

Many  visual  tracking  approaches  have  been  proposed  in  the  literature  to  address  the  unique  require¬ 
ments  of  different  applications.  These  tracking  approaches  can  be  categorized  in  various  ways  such 
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Figure  23:  Improved  building  mask  using  geodesic  active  contours:  blue  lines  are  the  initial  building  contours  which  are  evolved  and  stopped  at 
building  actual  boundaries  (red  lines).  As  it  can  be  seen,  the  previously  filtered  detected  cars  by  initial  building  mask  are  revealed  and  counted 
as  true  detections. 

as  detection-based  [92]  [18  ]  versus  recognition-based  [178][220];  single-object  [220][8(][  58]  versus 
multi-object  [1 1  ][30][  4];  or  according  to  how  the  objects  are  localized  within  the  search  region,  as 
generative[108][138]  versus  discriminative  methods  [126] [40]. 

3.9.1  CS-LoFT:  Color  and  Scale  Adaptive  LoFT  for  Single  Object  Tracking 

CS-LoFT  extends  the  original  LoFT  framework  which  have  shown  to  obtain  good  results  for  WAMI 
datasets  [170][  168]  [  5<  ]  with  (i)  color  attributes  for  improved  appearance  description;  (ii)  automated 
feature  scale  selection  for  scale  change  adaptation;  and  (iii)  an  operation  mode  decision  unit  to  switch 
between  CS-LoFT’s  three  different  modes  of  operation  (color,  intensity,  and  motion  kinematics)  based 
on  appearance  reliability.  Figure  25  illustrates  the  operation  mode  decision  with  some  examples.  Color 
information  is  incorporated  to  the  LoFT  framework  using  our  extension  of  Color  Name  (CN)  scheme 
introduced  in  [206].  Max-pooling  scale  selection  is  done  on  likelihood  maps  obtained  for  local  shape 
index  and  normalized  curvature  index  features.  The  operation  mode  decision  unit  activates  color  use  when 
there  is  adequate  color  dissimilarity  between  the  tracked  object  and  its  surrounding  region.  Incorporation 
of  color  and  scale  selection  improves  the  tracker  performance  particularly  for  sequences  where  the  tracked 
objects  have  distinct  color  signatures  and  undergo  rapid  scale  changes. 
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Figure  24:  LoFT  features. 


Figure  25:  Decision  mode  operation,  with  three  modes  (color,  intensity,  and  kinematics),  the  tracker  switch  across 
these  three  modes  according  to  the  sequence  environment. 


45 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


3.9.1. 1  Adaptive  Color  Use  with  Color  Names  (CN)  Feature 

Most  modern  trackers  ignore  color  information  and  rely  purely  on  features  derived  from  grayscale  infor¬ 
mation  [  1][22  ][  >6].  Grayscale  images  are  sometimes  sufficient  to  produce  reasonably  good  tracking 
results  for  lower  computational  cost.  However  rich  chromatic  information  can  be  very  valuable  for  ob¬ 
ject  tracking.  Various  methods  have  been  proposed  to  incorporate  color  information  to  trackers  such  as 
simple  color  space  transformations  in  [172,  153]  or  color  histograms  in  [225].  More  recently,  well  per¬ 
forming  trackers  have  been  further  improved  with  incorporation  of  color  information.  In  [50]  Centinet  et 
al.  combined  color  histograms  with  Minimum  Output  Sum  of  Squared  Error  (MOSSE)  correlation  filter 
[40]  as  a  robust  descriptor  for  object  tracking.  Danelljan  et  al.  [62]  explored  the  contribution  of  color 
in  tracking-by-detection  frameworks  and  extended  CSK  tracker  [9  ]  with  various  color  incorporation  ap¬ 
proaches.  Benlefki  et  al.  [32]  used  color  attributes  along  with  motion  masks  to  boost  tracker  performance. 
While  color  introduces  rich  information  for  object  description,  color  measurements  can  vary  significantly 
over  an  image  sequence  due  to  variations  in  illumination,  shadows,  and  specular  reflections.  Invariance  to 
these  factors  has  been  studied  in  image  classification[96][153],  action  recognition [9i  ],  and  tracking  [  52]. 

CS-LoFT  extends  the  original  LoFT  framework  with  color  information  using  our  extension  of  Color 
Names  (CN)  feature  [206]  (CN-Ex)  .  The  goal  is  to  boost  the  performance  of  the  tracker  particularly 
for  video  sequences  where  foreground  and  background  have  distinct  color  signatures.  Linguistic  study 
described  in  [3  ■  ]  shows  that  English  language  contains  eleven  basic  color  terms:  black,  blue,  brown,  grey, 
green,  orange,  pink,  purple,  red,  white  and  yellow.  Color  Names  (CN)  feature  associates  RGB  color 
model  with  linguistic  color  labels.  We  use  the  mapping  provided  by  [206]  to  map  RGB  color  models 
to  10  color  names.  The  process  reduces  the  2563  RGB  color  space  to  101  color  names  space.  While 
CN  includes  mapping  for  black,  gray,  and  white.  We  add  three  additional  bins  for  low  saturation  regions 
(low  saturation  &  low  intensity,  low  saturation  &  mediaum  intensity,  low  saturation  &  high  intensity),  to 
prevent  their  switch  from  one  color  name  to  another  with  slight  illumination  changes,  shadow  etc. 

Color  information  is  used  in  the  tracking  process  only  when  tracked  object  and  its  surroundings  have 
distinct  color  features.  C-LoFT  uses  an  adaptive  tracking  scheme  and  uses  color,  intensity,  or  motion 
kinematics  depending  on  availability  and  reliability  of  these  features.  We  use  Bhattacharyya  distance 
[37]  to  measure  color  similarity  between  target  and  the  surrounding  area.  Let  H Qbj  and  be  13  —  bin 
color  names  histograms  for  tracked  object  and  surrounding  background  respectively.  Discrete  probability 
densities  p(i)  for  the  object,  and  q(i )  for  the  background,  are  computed  by  normalizing  each  histogram  by 
corresponding  number  of  elements  in  it: 

p(i)  = - J- — ;  q(i)  =  —*■ —  (54) 

^ obj  Tlbg 

Bhattacharyya  distance  between  object  versus  background  color  distributions  is  computed  as: 

13 

s  =  Yi  VW)W)  (55) 

2=1 

Inclusion  (or  exclusion)  of  color  feature  is  determined  for  each  sequence  based  on  the  value  of  S.  Fig¬ 
ure  26  shows  the  process  of  calculating  S  for  two  difference  sequence.  Low  values  of  S  correspond  to 
discriminant  color  information  (dissimilar  object  versus  background  color  distribution).  Figure  27  illus¬ 
trates  the  incorporation  of  color  information  in  C-LoFT. 

3.9.1.2  Scale  Space  Tracking 

When  tracked  objects  move  along  the  camera  axis,  their  sizes  in  the  video  change.  Many  trackers  such  as 
CSK[96],  or  MOSSE[40]  track  the  objects  of  interest  at  a  single  scale  and  ignore  possible  changes  of  size. 
This  implies  poor  performance  in  sequences  with  significant  scale  variations.  Incorporation  of  multi-scale 
templates  as  in  DSST  (Discriminative  Scale-Space  Tracking)  [60]  improves  tracking  performance  through 
scale  changes. 
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Figure  26:  Two  examples  of  different  sequences  show  the  difference  values  of  S  depends  on  the  similarity  between 
target  and  surrounded  region. 


]  Likelihood  Fusion  [ 


Final  Fused  Map 


Figure  27:  Incorporation  of  color  information  in  C-LoFT.  In  which  the  likelihood  map  of  I  represents:  Intensity 
histogram,  GM :  Gradient  magnitude  histogram,  SI:  Shape  index  histogram,  NCI:  Normalize  curvature  index 
histogram,  HoG :  Histogram  of  gradient,  NCC(I):  Normalized  cross  correlation  for  intensity,  and  NCC(GM ): 
Normalized  cross  correlation  for  Gradient 


CS-LoFT  extends  LoFT  with  multi-scale  processing.  Scale  changes  not  only  alter  size  of  the  tracked 
objects  but  also  affect  scale  of  their  texture.  LoFT  features  that  are  most  sensitive  to  scale  changes  are 
local  shape  features  local  shape  index  (Eq.58)  and  normalized  curvature  index  (Eq.59)  [156]  derived  from 
eigenvalues  of  Hessian  matrix. 

Hessian  matrix  H  describes  the  second  order  structure  of  local  intensity  variations  around  each  image 
point  L(x\y ), 


H{x,y;a ) 


Lxx(x,y;a )  Lxy(x,y,a) 
Lxy(x,  y,  a)  Lyy(x,y;a ) 


(56) 


T 

where  Lxy  =  L(x ,  y\  a)  refers  to  the  Gaussian  smoothed  intensity  image 


L(x,  y:  a)  =  g(x,  y:  a)  *  L(x,  y) 


(57) 
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g  is  the  Gaussian  kernel,  and  a  is  the  scale.  Local  shape  index  (Eq.58)  and  normalized  curvature  index 
(Eq.59)  are  derived  from  the  eigenvalues  Ai  >  A2  of  Hessian  matrix  H. 


0(x,  y )  =  tan 


1  =  tan-1 

(58) 

((Ai  (x,y)2)  +  (A  2(x,y)2))^ 

1  +  L(x,y) 

(59) 

Frame  Level  Max-Pooling  Scale  Selection: 

We  developed  a  novel  frame  level  scale  selection  as  described  in  Algorithm  8.  First  local  shape 
(Eq.  58)  and  normalized  curvature  (Eq.  59)  indices  are  computed  at  multiple  scales.  Sliding  window 
template  matching  is  performed  between  search  window  and  tracked  object  template;  and  likelihood  maps 
C(Ti  are  obtained  for  each  scale  ah.  Then,  pixelwise  maximum  likelihood  is  computed  as: 


Cmax{x,y)  =  max(Cai(x,y))  (60) 

Finally,  best  scale  cr*  is  then  selected  as  the  scale  that  minimizes  the  mean  square  error  between  the 
pixelwise  maximum  likelihood  Cmax  and  each  C(Ji . 

a*  =  argmin(^2(Cmax  -  Cai))  (61) 

1  x,y 

The  described  frame  level  max-pooling  scale  selection  approach  is  applied  to  shape  index  and  normalized 
curvature  index  separately.  Figure  28  describes  frame  level  max-pooling  scale  selection  approach.  Like¬ 
lihood  maps  corresponding  to  selected  scales  are  fused  with  likelihood  maps  for  the  remaining  features. 
Proposed  frame  level  scale  selection  ensures  (i)  scale  coherence  in  the  selected  likelihood  map  (all  pixels 
correspond  to  the  same  scale);  and  (ii)  robustness  against  local  outliers. 


Pixel-wise  Maximization 


Likelihood 
i  map  with 

minimum 
n  error  ✓ 


Figure  28:  Frame  level  max-pooling  scale  selection  approach. 


For  this  project,  we  have  developed  both  recognition-based  single-object  and  association-based  multi¬ 
object  tracking  systems.  The  goal  of  our  single-object  tracking  systems  is  to  track  few  objects  of  interest 
in  the  scene.  The  goal  of  our  multi-object  tracking  system  is  to  track  all  the  moving  objects  in  a  particular 
region  of  interest  for  analysis  of  motion  behavior  patterns.  MU  single-object  and  multi-object  tracking 
systems  are  briefly  described  below. 

3.9.2  KC-LoFT:  Kernelized  Correlation  Extended  LoFT  for  Single  Object  Tracking 

Correlation  filter-based  tracking  is  a  discriminative  method  in  which  a  filter  based  on  the  object  appearance 
is  used  to  estimate  the  most  likely  target  location  in  the  search  window  by  distinguishing  the  target  from  its 
surrounding  background.  Convolution  of  the  search  window  image  with  the  filter  is  performed  in  Fourier 
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Algorithm  8  Frame  level  max-pooling  scale  selection 

Input  :  Isw,It •  search  window  and  object  template 
Output  :  cr*,  £*:  best  scale  and  corresponding  likelihood  map 
1:  for  each  scale  i  do 

2:  L(x,  <Ji)  4r-  g{x,y\<Ji)  *  I(x,y) 

3:  Compute  Hessian  matrix  H (x,  y\  a)  Eq.  56  and  its  eigenvalues  Ai,A2 

4:  Compute  local  shape  features  <f)(x,  y )  Eq.  58,  0(x,  y)  Eq.  59 

5:  C(Ti  sliding  window  template  matching  between  search  window  &  object  template 

6:  end  for 

7:  Pixelwise  maximum  likelihood: 

y )  max(Cai(x,y)) 

8:  Best  frame  level  scale: 

cr*  <-  argmin(J2(£max  ~  £ ^ )) 

i  x,y 


domain  as  element-wise  multiplications  to  reduce  complexity.  The  used  filters  are  updated  online  to  adapt 
to  object  appearance  changes.  Since  their  first  use  in  visual  tracking  [40]  [9  ],  correlation  filter-based 
tracking  has  been  extended  in  multiple  ways.  [126]  [60]  improved  adaptation  to  scale  changes;  while  [34] 
incorporated  histogram  based  ridge  regression  learning  to  improve  robustness  to  fast  deformation  and 
shape  variations. 

For  a  tracking  system  to  handle  a  variety  of  challenging  conditions,  it  is  important  to  use  multiple 
information  cues  robust  to  these  conditions.  Feature  fusion  needs  to  be  explored  in  order  to  overcome 
weakness  of  individual  information  cues  and  to  strength  the  overall  process.  Some  feature  fusion  methods 
used  in  visual  tracking  are  fixed  linear  combination  [123  ]  [12  ],  adative  weight  update  [5  ][  50],  variance 
ratio  [56],  Bayesian  probability  distrubuation[195],  Bhattacharyya  distance  [12],  peak-to-sidelobe  ratio 
[40]. 

Our  group  developed  a  Likelihood  of  Features  Tracking  (LoFT)  system[167]  (described  above)  that 
fuses  multiple  sources  of  information  about  target  and  its  environment  to  perform  robust  single  object 
tracking.  KC-LoFT  extends  and  improves  the  original  LoFT  framework  by  incorporating:  (i)  a  discrimi¬ 
native  module  using  kernelized  correlation  filters  to  strength  LoFT’s  adaptation  to  environment  changes; 
(ii)  a  multi-dimensional  kernelized  template  (beside  LoFT’s  original  multi-feature  template)  to  ensure 
comprehensive  localization  of  the  target  in  different  scenarios;  and  (iii)  a  decision  module  to  adaptively 
update  and  fuse  kernelized  template.  The  new  decision  module  uses  peak-to-sidelobe  ratio  (PTSR)  crite¬ 
rion  to  avoid  fusing  any  irrelevant  response  from  the  kernelized  correlation  filters  to  the  other  LoFT  feature 
maps  and  to  prevent  update  of  the  regression  parameters  and  correlation-based  template  during  occlusion 
or  other  cases  of  sudden  appearance  change.  We  have  tested  the  proposed  KC-LoFT  tracker  on  wide  area 
motion  imagery  and  compared  its  performance  to  non-deep  learning  trackers  from  VOT2015[11  ]  and 
VOT2016[1 15]  challenges  (Table  4)  whose  codes  were  publicly  available. 

KC-LoFT  kernelized  correlation  module  uses  two  features,  HoG  and  intensity,  to  localize  the  target 
within  the  search  window.  The  two  features  are  stacked  to  form  a  single  vector  x  that  is  then  used  in  the 
kernelized  correlation  filter  (KCF)  scheme.  The  goal  of  correlation  filter  is  to  minimize  the  square  error 
over  the  sample  x  and  the  expected  target  y  using  the  ridge  regression  loss  function  defined  as  follow 

n 

min  y2(f(xi)  -  Vi)2  +  A  Ww\\  (62) 

i 

where  n  is  the  length  of  the  feature  vector  and  A  controls  the  regularization  parameter  w  over  the 
linear  combination  of  samples  f(x)  =  wTx.  Following  the  description  and  derivatives  on  [9"  ]  regression 
learning  parameter,  a  can  be  learned  using 


a  =  - 


y 


kxx  +  A 


(63) 
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where  kxx  is  a  correlation  kernel.  Gaussian  kernel  is  used  as  follow 

kxx  =  exp( - ^  (llxl|2  +  ||x/||2)  —  (2F~1(x  ©  x  *))  (64) 

Where  x  is  the  DFT  of  x,  x*  is  the  complex-conjugate  of  x \  ©  is  an  element  wise  multiplication,  and 
"is  the  discrete  Fourier  transform.  To  detect  the  position  of  the  object,  a  new  patch  z  (search  window)  is 
cropped  form  the  location  estimated  from  Kalman  filter.  The  response  is  found  according  to  the  correlation 
between  previously  learned  template  x  and  new  patch  z. 

f{z)  =  (kxzy  ©  a  (65) 

The  steps  of  the  target  detection  and  online  training  process  are  described  in  Algorithm  9,  where  6  is 
the  learning  rate  assumed  to  be  0.1. 

Algorithm  9  Target  detection  and  training  process  using  correlation  filter  module 

Input  :  Pf-i,z,df-i,x:  predicted  position,  current  patch  (search  region),  dual  space  coefficient,  the 
previous  updated  correlation-based  template 

Output  :  Pf ,  (if ,  xnew :  estimated  new  position,  the  updated  dual  space  coefficient,  updated  correlated- 
based  template 

1:  Calculate  the  kernel  kxz  =  exp(— 4^  (||x||2  +  ||z||2)  —  2 F~x{x  ©  £*)) 

2:  Calculate  the  response  f(z )  =  ( kxzy  ©  i 

3:  Find  the  new  position  Pf  from  the  maximum  response  f(z ) 

4:  Crop  new  patch  xnew  with  Pf  as  a  center  of  the  region 
5:  Update  the  template  xnew  =  9x  +  (1  —  0)xnew 
6:  Calculate  anew  =  — — - - 

new  kXXnew  -j-A 

7:  Update  dual  space  coefficient  otf  —  Oaf- 1  +  (1  —  0)anew 


3.9.2.1  KC-LoFT:  Integration  of  Modules 


(LoFT Features):  1.  Histogram-Based:  Intensity.  Gradient 
Magnitude,  Shape  Index,  Normalized  curvature  Index. 

2.  Correlation-Base:  Intensity.  Gradient  Magnitude. 

Kernelized  Correlation  Filter(  KCF)  features:  Intensity, 
Histogram  of  Gradient  HoG. 


Template 

Update 


Final  Probability 
Map 


Figure  29:  KC-LoFT  template  update  and  likelihood  fusion  pipelines. 

KC-LoFT  integrates  the  discriminative  KCF  module  to  the  LoFT  framework  to  enable  robust  and 
efficient  localization  of  targets.  KC-LoFT  also  incorporates  an  online  template  update  scheme  to  LoFT 
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to  prevent  drifts  and  to  enable  robust  handling  of  partial  or  full  occlusions.  Figure  24  illustrates  modules 
involved  in  the  proposed  KC-LoFT  pipeline. 

Correlation  filter  learns  filter  parameters  and  updates  template  for  each  frame  by  including  positive  and 
negative  examples  within  the  search  window  using  ridge  regression.  KC-LoFT  fuses  the  responses  from 
LoFT  features  with  the  correlation  response  from  the  correlation  module  to  generate  a  final  fused  proba¬ 
bility  map  that  is  used  to  localize  the  target.  Because  the  correlation  template  update  happens  during  the 
processing  of  every  frame,  online  update  on  correlation-based  template  has  faster  response  to  appearance 
changes  than  appearance-based  template.  The  response  f(z )  from  the  correlation  filter  module  is  fused 
with  the  other  appearance-based  template  likelihood  maps  from  LoFT  to  generate  a  final  likelihood  map 
to  localize  the  expected  position  of  the  target.  Peak-to-sidelobe  ratio  (PTSR)  likelihood  map  evaluation 
criterion  is  used  to  avoid  updating  the  correlation-based  template  and  regression  parameters  during  occlu¬ 
sion  and  to  prevent  correlation  likelihood  map  to  be  fused  with  the  remaining  maps  when  it  is  unreliable. 
Peak-to-sidelobe  ratio  [40]  evaluates  the  discrimination  power  of  a  likelihood  peak  as  follow: 


PTSR 


Pmax  ^sidelobe 
& sidelobe 


(66) 


where  Pmax  is  the  value  of  the  maximum  response,  iisideiobe  and  cr sidelobe  are  the  mean  and  standard 
deviation  of  the  likelihood  map  values  except  an  11  x  11  area  around  the  maximum  peak.  Figure  30  shows 
sample  likelihood  maps  and  associated  PTSR  measurements. 

During  occlusion  events,  PTSR  helps  suspend  correlation  template  update  until  the  object  appears 
again.  Figure  29  illustrates  KC-LoFT  template  update  and  likelihood  fusion  processes.  Suspension  of  the 
template  update  process  is  important  to  preserve  the  target  template  through  occlusion  events. 


Figure  30:  Sample  likelihood  maps  and  associated  peak-to-sidelobe  ratio  values.  Lower  PTSR  values  indicate  occlusions  or  other  sudden 
appearance  change  scenarios  and  suspend  the  template  update  process. 


3.9.3  MU  Multi-Object  Tracking 

Multi-object  tracking  (MOT)  is  the  process  of  locating  objects  of  interest  in  a  video  sequence,  main¬ 
taining  their  identity  over  time,  and  producing  set  of  trajectories  corresponding  to  their  motion.  Our 
MOT  system  is  a  tracking-by-detection  framework  consisting  of  three  stages:  (1)  detection  validation 
and  filtering,  (2)  local  data  association,  and  (3)  global  data  association  (Figure  31).  For  each  frame  It 
of  a  given  video  sequence  V  =  {/i,/2,  of  length  Q ,  there  are  set  of  N(t)  detected  objects 

Dt  =  {dt5i,  dt:2,  •  •,  dt,N(t)}  where  dt,i  represents  object  i  in  frame  It.  Each  detected  object  dt,i  is  encoded 
with  the  vector  ( dt ,i  [x] ,  dt ,i  [y] ,  dt, i  [w] ,  dt,i  [h] ,  st,i)»  where  the  entries  represent  position,  width,  height,  and 
detection  score  respectively.  Tracking  determines  a  set  of  associations  A  =  {02,  03, 0q},  where  each 
Ot  represents  an  assignment  matrix  from  tracks  to  detections  at  frame  R. 

Detection  filtering  step  is  applied  to  eliminate  potential  false  detections  produced  by  the  detector. 
Objects  with  low  detection  scores  are  removed  to  reduce  false  positive  score.  Local  data  association  is  an 
online  step  that  links  the  detected  objects  in  the  current  frame  to  the  existing  tracks.  Hungarian  Algorithm 
[148]  is  used  to  resolve  associations  on  a  cost  matrix  relying  on  spatial  distances  between  tracks  and 
detections.  Global  data  association  is  an  offline  tracklet  to  tracklet  (rather  than  object  to  tracklet)  linking 
step.  Tracklets  (short  tracks)  terminated  early  because  of  problems  such  as  occlusions  or  weak  detection 
scores,  are  linked  to  tracklets  formed  after  object  recovery  using  object  appearance  cues  and  tracklet 
motion  history.  Appearance  is  described  and  compared  using  HoG  descriptor  [59]  and  CN  (color  name) 
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Association 


Figure  31:  The  framework  of  Multi-Object  Tracking. 

histogram  along  with  a  novel  weighted  CN  histogram  dissimilarity  measure.  Below  sections  summarize 
main  steps  involved  in  the  proposed  tracking  system. 

3.9.3. 1  Track  Initialization 

When  a  new  track  is  formed,  three  groups  of  information  corresponding  to  the  new  track  are  initialized: 
(1)  detected  object’s  location,  width,  height,  and  appearance;  (2)  counters  for  age,  visible  frames,  and 
invisible  frames;  and  (3)  Kalman  filter  parameters  KF\  =  { x\ ,  Pf}  where  x\  represents  state  estimate  for 
the  object  i  at  frame  t  and  Pi  represents  associated  covariance  matrix. 

3.9.3.2  Track  Position  Prediction 

Association  is  done  between  detected  object  positions  Dt  and  predicted  track  positions  Tt  at  frame  t. 
Constant  velocity  Kalman  filter  is  used  to  predict  track  positions.  The  process  involves  two  steps:  (1) 
prediction  step  that  uses  the  current  state  estimate  from  t  —  1  to  predict  the  estimate  for 

time  £;  and  (2)  update  step  where  the  current  prediction  { x\ ,  P\}  is  combined  with  current  observation 
(detection)  information  { d\ ,  R\}  to  refine  the  state  estimate  for  future  predictions,  where  d\  is  the  position 
and  R\  is  the  observation  covariance  matrix  of  the  detected  object  i . 

3.9.3.3  Local  Data  Association  using  Spatial  Distance 

For  the  tracking-by-detection  systems,  the  biggest  challenge  is  the  association  of  the  noisy  object  detec¬ 
tions  P*  =  {di,  g?2,  •  •••,  djy}  inavideo  frame  with  the  previously  tracked  objects  Tl~l  =  {Ti,  T2, ....,  Tm}- 
Detected  objects  are  assigned  to  existing  tracks  by  minimizing  a  cost  matrix  using  Munkres  Hungarian 
algorithm  [148].  Elements  of  the  cost  matrix  are  computed  as: 

C(i,j)  =  log  \\di(x,y),Tj(x,y)\\2  (67) 

where  di(x,  y )  and  Tj(x,  y )  are  the  centroids  of  detected  objects  and  predicted  tracks  respectively.  We 
use  circular  gating  around  predicted  track  positions  to  eliminate  highly  unlikely  associations,  to  reduce 
computational  cost,  and  to  reduce  false  matches.  Minimization  on  the  cost  matrix  results  in  the  assignment 
T  x  2  matrix  At  containing  the  indices  of  the  corresponding  detection  and  track  pairs.  At  determines  track 
states  for  frame  t.  The  four  possible  states  {new  track,  extended  track,  lost  track,  inactive  track}  are 
illustrated  in  Figure  32.  State  descriptions  and  associated  parameter  updates  are  summarized  in  Table  1. 

3.9.3.4  Global  Data  Association  using  Spatial  Distance  and  Appearance  Similarity 

Various  problems  during  object  detection  and  data  association  stages  may  cause  tracks  to  terminate  early 
resulting  in  short  tracklets  rather  than  full  tracks.  Global  data  association  is  used  to  link  these  tracklets  to 
generate  longer  tracks.  Global  data  association  is  an  expensive  process  because  it  involves  optimization  of 
all  possible  matches  not  just  the  ones  between  consecutive  frames.  Reducing  the  number  of  objects  that 
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Extended 


Figure  32:  Decision  State  after  assignment. 

Table  1:  State  description  and  associated  parameter  updates. 


State 

Parameter  Updates 

Description 

New  Track 

-  Track  ID 

-  Kalman  filter  state  KF\  =  { x\ ,  P\} 

-  Counters:  age,  visible  frames,  invisible  frames 

-  Appearance:  CN  color  histogram  &  HoG  descriptor 

Detected  objects  not  assigned  to  any 
existing  tracks  start  new  tracks. 

Extended 

Track 

-  Kalman  filter  state  KF\  =  { x\ ,  P\} 

-  Counters:  age,  visible  frames 

Detected  objects  successfully 

matched  to  existing  tracks  extend 
those  tracks. 

Lost  Track 

-  Counters:  age,  invisible  frames 

Tracks  not  assigned  to  any  detected 
objects  are  assigned  to  lost  state. 

Inactive  Track 

Save 

-  Track  ID 

-  Full  trajectory  (previous  and  last  seen  positions) 

-  Appearance:  CN  color  histogram  &  HoG  descriptor 

-  Counters:  age,  born/death  frames 

After  spending  A  time  steps  in  lost 
state,  unmatched  tracks  are  termi¬ 
nated. 

need  to  be  associated  is  helpful  in  reducing  this  computational  cost.  We  use  a  refinement  process  relying 
on  spatial  distance,  start  and  end  frames,  motion  direction,  and  appearance  model  of  the  tracklets  to  filter 
out  infeasible  tracklet  matches  before  the  global  assignment  step.  Given  a  tracklet  Kh  and  the  initial  set 
of  tracklets  J  =  {iCi,  K2,  ...,  iC&},  the  refinement  process  filters  out  the  tracklets  with  the  following 
properties  from  the  set  J  as  infeasible  forward  matches  for  FQ.  Where  forward  linking  refers  to  cases 
where  tracklets  are  appended  to  the  end  of  FQ. 

1 .  All  tracklets  Kj  that  are  initialized  on  frame  borders  or  other  source  positions  (tracklets  entering 
the  scene). 

2.  All  tracklets  Kj  that  are  born  before  the  death  of  tracklet  K 

3.  All  tracklets  Kj  that  start  beyond  the  spatial  distance  Ts  from  the  last  position  of  tracklet  Kh. 

4.  All  tracklets  Kj  that  start  beyond  the  temporal  distance  Tt  from  the  last  frame  of  tracklet  Ki. 

5.  All  tracklets  Kj  whose  appearance  distances  from  tracklet  Kh  are  above  Ta.  Appearance  is  de¬ 
scribed  in  terms  of  color  using  CN  (color  name)  distribution  and  in  terms  of  shape  and  texture  using 
HoG  descriptor. 

Refinement  process  for  backward  linking  where  Ki  gets  inserted  to  the  end  of  another  tracklet  is  done  in 
a  similar  way.  Once  all  potential  match  sets  are  refined.  Global  data  association  determines  the  tracklet 
to  tracklet  associations  by  minimizing  spatial  and  appearance  distances.  Details  on  our  appearance  model 
and  appearance  comparison  scheme  are  given  in  Section  3. 9. 3. 5. 

3.9.3.5  Appearance  Model  for  Tracklet  Linking 

It  is  important  to  have  discriminant  features  to  filter  out  infeasible  detection-to-tracklet  or  tracklet-to- 
tracklet  associations.  Tracked  objects’  appearance  models  provide  powerful  information  to  refine  the  set  of 
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candidate  matches.  However,  appearance  is  also  sensitive  to  external  factors  such  as  illumination  changes, 
shadows,  partial  occlusions,  pose,  viewing  angles  etc.  It  is  important  to  develop  appearance  models  that 
can  discriminate  different  tracked  objects,  while  being  invariant  against  factors  such  as  illumination,  pose 
etc.  In  this  work,  we  propose  an  appearance  model  that  combines  shape  and  texture  information  using  HoG 
descriptor  with  object  color  attributes  using  our  novel  color  correlation  cost  matrix.  HoG  [59]  is  a  widely 
used  powerful  descriptor  that  describes  shape  and  texture  through  histogram  of  gradient  orientations  in 
local  image  regions.  HoG  is  particularly  suitable  for  description  of  rigid  objects  such  as  cars  in  the  UA- 
DETRAC  dataset  [211].  We  record  the  HoG  descriptor  for  all  new  tracks  at  the  time  of  track  initialization. 
Mean  square  error  (MSE)  is  used  to  compute  the  distance  between  HoG  descriptors. 

We  incorporate  color  information  through  an  extension  of  the  CN  (Color  Names)  model  proposed 
in  [20(  ].  Linguistic  study  described  in  [  61]  shows  that  English  language  has  eleven  basic  color  terms: 
black,  blue,  brown,  gray,  green,  orange,  pink,  purple,  red,  white  and  yellow.  [206]  explores  this  concept 
to  generate  CN  (Color  Names)  map.  CN  model  associates  RGB  color  values  with  linguistic  color  labels 
and  reduces  the  2563  RGB  color  space  to  just  ll1  CN  color  space.  Biggest  challenge  in  color  description 
is  sensitivity  to  illumination.  Illumination  variations  and  shadows  may  alter  color  (and  intensity)  of  an 
object  making  the  information  not  reliable  (e.g.  shadow  may  change  blue  to  black,  or  yellow  to  brown). 
When  performing  color  appearance  comparison,  it  is  important  to  consider  similarity  of  individual  color 
codes  and  likelihood  of  colors  switching  from  one  value  to  another  (e.g.  while  blue,  yellow,  and  orange 
are  three  distinct  color  names,  distance  between  blue  and  yellow  is  higher  than  distance  between  yellow 
and  orange,  and  transition  likelihood  from  blue  to  yellow  is  lower  compared  to  transition  probability  from 
yellow  to  orange).  In  [171],  O’Pele  et.al.  proposed  a  color  distance  weight  matrix  fusing  CIEDE2000 
color  dissimilarity  [187]  with  CN  color  pair  probabilistic  distance  matrix.  In  this  work,  we  have  built 
and  used  an  11  x  11  CN-to-CN  color  correlation  weight  matrix  V\?cn  to  account  for  similarity  between 
different  CN  values  during  color  distribution  comparison.  The  elements  of  the  color  correlation  matrix  are 
computed  as  follows: 


WcN(ci,Cj)  =  1  -  2  x  max ( min  {G{k,  i),  Q{k,  j))) 

k 


(68) 


where  c%  and  cj  are  two  CN  codes  and  Q  is  the  2 15  x  11  matrix  describing  the  mapping  from  32  x  32  x 
32  quantized  RGB  space  to  11 -valued  CN  space  provided  by  [206].  In  order  to  compare  object  color 
distributions,  we  use  Earth  mover  distance  (EMD)  [182].  EMD  computes  the  minimum  paid  cost  to 
transfer  one  histogram  to  another  histogram.  We  use  color  correlation  matrix  Wcn  as  transfer  cost.  See 
Figure  33  for  more  details.  Use  of  cross-bin  color  histogram  distance  computation  using  EMD  scheme  and 
color  correlation  matrix  W  described  above  decreases  sensitivity  to  color  changes  caused  by  illumination 
variations  and  improves  tracking  results  compared  to  bin-to-bin  color  histogram  comparison. 

3.10  Feature  Work 

3.10.1  3D  Surface  Recovery,  Regularization,  Segmentation  and  Clustering 

3.10.1.1  3-D  Bilateral  Filter  for  Point  Cloud  Smoothing 

Various  edge  preserving  denoising  and  smoothing  techniques  exist  for  removing  noise  and  restoring 
faint  structures  in  surface  data.  Variational  and  PDE  based  methods  [28],  bilateral  filter  [202],  nonlo¬ 
cal  means  [42]  are  some  of  the  powerful  methods  in  edge  preserving  denoising.  Extensions  to  3D  data 
which  can  be  applied  to  surfaces,  mesh  and  volumes  have  been  considered  [110,  75,  139].  The  bilateral 
filter,  in  particular,  is  appealing  for  denoising  3D  data  due  to  its  simple  nature  and  its  extensible  proper¬ 
ties  [35].  The  bilateral  filter  is  an  edge  preserving  denoising  filter  that  has  been  widely  used  for  image  and 
image  plus  depth  map  smoothing.  We  briefly  recall  the  bilateral  filter  (BLF)  which  is  well-known  in  2D 
image  smoothing  and  sketch  the  mesh  denoising  version  of  it.  The  original  BLF  is  of  the  form 


(69) 


yeN(x) 
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Tracklet  1 


HoG  1 


CN  Histo  1 


Link  Trackletl  with  Tracklet2 


Figure  33:  Apperance  model  refinment  for  global  assignment.  Trackleti  need  to  be  assigned  to  one  of  the  candidates  (Tracklet2,3,4)  that  born 
after  Trackleti .  In  the  first  step,  Tracklet3  has  been  excluded  by  HoG  constrain.  Then,  Tracklet4  has  been  excluded  by  color  constrain.  Finally, 
After  refinement  Trackleti  is  linked  with  Tracklet2 


where  N(x)  defines  the  neighborhood  of  pixel  position  x ,  K(x)  the  normalization  factor,  and  I {x)  is  the 
signal  to  be  processed  at  x,  Wc,  Ws  are  usually  chosen  as  Gaussian  functions.  To  extend  the  BLF  to  mesh 
denoising,  we  need  to  apply  the  filter  to  the  signal  locally  defined  at  each  vertex.  To  obtain  locally  defined 
signal  points  we  can  either  project  a  vertex  to  the  first  order  surface  predictors  at  its  neighbors  [110]  or 
project  its  neighboring  vertices  to  the  predictor  at  the  vertex  [75]. 

3.10.1.2  Smoothed  Signed  Distance  Surface  Reconstruction 

We  consider  several  approaches  to  obtain  a  surface  from  a  finite  set  of  oriented  points.  One  can  utilize 
the  three-dimension  smoothed  signed  distance  (SSD)  function  [47].  This  method  uses  a  variational  for¬ 
mulation  to  reconstruct  a  watertight  surface  using  implicit  function  and  represents  the  surface  as  a  signed 
distance  field.  SSD  uses  a  Poisson  reconstruction  process  and  is  defined  in  the  continuous  minimization 
form  as: 


N 


N 


mm  A  Y  /(^)2  +  Y  ii  v/(x*)  - 

2=1  2=1 


/  N 


Ui 


\V\ 


[  WHf^w 

Jv 


dx, 


(70) 


where  {{xi,  rii)}f=l  is  the  oriented  point  cloud  (xi  surface  location  sample,  rii  surface  normal),  f  :  V  C 
M3  -A  R  is  a  signed  distance  field  on  a  bounded  volumetric  domain  V,  H  is  the  Hessian,  Ai,  A2  constants, 
and  N  is  the  number  of  points  in  the  data  set.  The  resulting  signed  distance  function  f{x)  is  then  utilized 
to  construct  a  mesh  using  dual  marching  cubes  algorithm.  The  limitations  of  the  schemes  are  extensive 
parameter  tuning  and  depending  upon  the  initial  noise  level  in  the  point  clouds  the  final  SSD  surfaces 
can  have  distorted  structures,  see  Figure  34.  Moreover,  there  is  no  explicit  planarity  model  constraint  and 
hence  building  tops  can  be  distorted. 


3.10.1.3  Variational  Optical  Flow  Refinement  of  an  Initial  Point  Cloud 

Alternatively,  assuming  a  rough  surface  S  is  obtained  by  either  Delaunay  triangulation  or  scattered  inter¬ 
polation  of  a  sparse  3D  feature  point  cloud,  we  can  use  dense  correspondences  computed  using  a  fast  and 
precise  variational  optical  flow  technique  that  extends  the  convex  optimization  approach  [87]. 

Let  K  be  the  camera  calibration  matrix.  Let  Pj  =  K[Rj  \  —  Rjtj]  be  the  projection  matrix  correspond¬ 
ing  to  image  Ij  (j  =  1,2,...,  TV  where  N  is  the  number  of  views  used).  Let  =  {xi,  yi^Zi)  be  the  point 
on  the  surface  S.  Let  [uj^vj]  =  u]  =  P J(X^)  be  the  projection  of  the  (visible)  point  X^  on  camera  j.  A 
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(a)  Example  SSD  reconstruction  with  roof-tops  distorted 


(b)  Partial  failure 


(c)  Complete  failure 


Figure  34:  Example  surface  reconstruction  with  the  SSD  method  on  Los  Angeles  point  cloud  data  obtained  using  our  voting  based  approach.  Top 
row:  Arrows  indicate  failure  regions  in  SSD  reconstruction.  Bottom  row:  Failed  reconstruction  results  due  to  sensitivity  to  noise  in  the  initial 
point  cloud. 


small  motion  of  the  point  gives  a  new  point 

X,  =  X,  +  AX, 

The  corresponding  image  displacement  is  given  by, 

U-  =  +  Auj 

Note  that  the  image  displacement  and  surface  displacement  at  a  point  X^  are  related  via  the  Jacobian: 

Au {  =  JJX  AX, 

where 


V  — 

Jx*  “  dX 


x. 


9Pj 

&Pj 

dPl 

dx 

dy 

dz 

&Pj 

dPj 

dPi 

dx 

dy 

dz 

x. 


The  following  main  steps  are  involved  in  estimating  the  X^  at  a  fixed  camera  image  k: 
•  Fix  a  camera  k,  then  all  points  must  satisfy: 

AX*  =  A<r* 

where  r,  is  the  unit  vector  direction  at  camera  k.  Then  for  all  the  cameras  j  ‘nearby’, 

Au|  =  J^..  AX*  =  J  xAir* 
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Compute  Au-  =  uj  —  uj : 

-  First  project  onto  the  base  surface  S  to  get  X^,  project  this  onto  nearby  cameras  j  to  get 


u. 


-  u;-  is  given  by  the  optical  flow  computed  at  uf . 
•  Solve  the  minimization: 


mm 

A* 


J^TiXi  -  Au i 


using  the  closed  form  solution 


^ i  ~  Jx  r^Au^ 


Finally, 


Xf  =  X?  + 


3.10.1.4  Simultaneous  Smoothing  and  Depth  Map  Fusion  Using  Variational  Methods 

We  propose  to  utilize  regularization  methods  for  smoothing  the  noisy  point  cloud  surfaces  without  in¬ 
troducing  distortions  or  removing  salient  3D  structures.  In  particular  we  focus  on  total  variation  (TV) 
based  methods  which  is  a  class  of  well-known  edge  preserving  regularization  approaches  [8  ].  If  we  let 
u  :  Q  C  M3  — »  [— 11]  be  the  truncated  signed  distance  function,  then  the  following  minimization  can  be 
used  to  obtain  simultaneous  smoothing  and  fusion  of  depth-map. 


r  N  f 

min  /  |Vu|  +  AV^  /  /i(x,  i)  |u(x)  —  di\ 

u  Jn 


dx 


(71) 


where  A(x,  i )  is  the  histogram  count  of  bin  i  at  voxel  x,  N  the  number  of  depth  maps  and  di  distance 
for  histogram  bin  i.  The  advantages  of  the  above  variational  method  are  (a)  efficient  convex  optimization 
based  solvers  which  can  handle  millions  of  points  (b)  data-driven  approach  to  handle  noisy  point  clouds 
and  (c)  extensible  for  further  optical  flow  based  refinements. 

To  obtain  highly  accurate  large-scale  dense  3D  reconstruction  variational  -  partial  differential  equa¬ 
tion  (PDE)  approaches  can  be  utilized  [24,  117].  In  particular,  we  will  consider  variational  regularization 
methods  based  on  total  variation  (TV)  and  total  generalized  variation  (TGV)  type  functionals  to  be  used 
along  with  our  3D  reconstruction  pipeline  for  improved  discontinuity  preservation.  For  example  to  com¬ 
pute  the  depth-map  Z  between  two  images  //  and  Ir  a  general  variational  method  can  be  posed  as  an 
energy  minimization  of  the  form: 

min£(Z)  =  f  (h(x)  —  Ir(f(x:  Z(x))))2dx  +  v  f  (f)(\X7Z\)dx  (72) 

z  Jn  Jn 

where  </>(•)  is  the  regularization  function  which  can  be  designed  to  prevent  smoothing  of  discontinuities  al¬ 
lowing  sharp  depth  map  edges  to  be  preserved  and  the  function  /(•,•)  depends  on  the  camera  parameters. 
We  plan  to  extend  this  general  approach  for  two  or  more  images  as  well  as  investigate  various  regular¬ 
ization  functions  which  can  aid  in  improving  final  reconstruction  quality  and  maintaining  fine  geometric 
details.  Further,  we  plan  to  leverage  state  of  the  art  convex  optimization  methods  for  efficiently  solving 
these  minimization  problems  and  use  the  fastest  numerical  algorithms  for  the  regularization  part  of  the  3D 
reconstruction  process. 


3.10.2  GPS  Denied  Environments 

In  GPS  denied  environments  only  partial  camera  pose  metadata  will  be  available;  namely  the  latitude  and 
longitude  location  information  of  the  camera  and  platform  will  not  be  available  or  only  sporadically  avail¬ 
able.  We  will  investigate  a  few  approaches  for  bundle  adjustment  algorithms  with  partial  information  in 
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the  context  of  Simultaneous  Localization  and  Mapping  (SLAM).  SLAM  is  a  core  requirement  to  enable 
navigation,  targeting  and  mission  execution  in  adverse  A2AD  (anti-access  or  access  denied)  environments. 
SLAM  is  a  fundamental  requirement  in  mobile  robotics  for  the  autonomous  platform  to  be  aware  of  its 
environment,  and  deals  with  building  a  geospatial  map  of  the  environment  while  simultaneously  determin¬ 
ing  the  pose  of  the  robot  or  platform  with  respect  to  the  same  map.  The  assumption  is  that  the  environment 
has  a  set  of  stationary  features  or  landmarks  which  can  be  observed  by  sensors  from  the  robotic  platform. 
For  GPS-denied  SLAM  we  propose  to  use  a  sparse  local  bundle  adjustment  approach  using  orientation 
only  noisy  IMU  sensor  measurements,  a  partial  a  priori  or  sensed  3D  scene  model  (that  could  be  stored 
on-board)  and  an  incremental  light  bundle  adjustment  method  like  that  proposed  in  [102  ]. 

Over  the  past  few  decades  there  has  been  extensive  effort  in  the  robotics,  computer  vision  and  pho- 
togrammetric  communities  in  finding  a  robust  solution  for  SLAM  due  to  a  variety  of  compelling  applica¬ 
tions  including  autonomous  (air,  ground,  sea,  underwater)  vehicles  for  defense  requirements,  hazardous 
environments,  search  and  rescue,  transportation  (ie  self-driving  vehicles),  entertainment,  etc.  Many  cur¬ 
rent  techniques  for  SLAM  are  derived  from  recursive  probabilistic  Bayes  methods  the  most  popular  of 
which  is  the  Kalman  Filter  (KF)  in  which  posteriors  are  represented  by  Gaussians  (Se,  2002).  There 
are  several  variations  of  KF  to  deal  with  non-linear  system  dynamics  such  as  the  Extended  Kalman  Fil¬ 
ter  (EKF),  Unscented  Kalman  Filter  (UKF),  Covariance  Intersection,  Information  Filtering  etc.  Another 
technique  used  in  the  FastSLAM  implementation  is  Particle  Filtering  (PF)  which  is  a  recursive  Bayesian 
Liter  with  Monte  Carlo  sampling  of  the  estimated  probability  density  function  [142].  PF  implements  Se¬ 
quential  Monte-Carlo  (SMC)  estimation  using  a  set  of  random  point  clusters  (particles )  representing  the 
Bayesian  posterior.  Expectation  Maximization  (EM)  is  another  method  used  in  SLAM.  It  is  a  statistical 
algorithm  based  on  Maximum  Likelihood  (ML)  estimation  which  is  ideal  for  map  construction  but  not  for 
localization  and  is  normally  used  when  the  robot  pose  is  known  within  a  given  map.  The  most  likely  map 
estimate  is  then  updated  given  the  pose  and  initial  map.  As  the  platform  moves,  EM  builds  the  map  while 
increasingly  improving  the  result.  EM  is  normally  used  when  the  mapping  problem  is  decoupled  from 
localization.  In  such  a  situation  it  can  be  combined  with  a  filtering  method  such  as  PF  for  localization. 
EM  is  advantageous  to  KF  since  it  handles  the  data  association  or  correspondence  problem. 

Full  SLAM:  The  current  state-of-the-art  approaches  are  based  on  pose  graphs  using  Dynamic  Bayesian 
Networks  and  global  bundle  adjustment  (BA)  [204]  combining  vision-based  electro-optical,  range-bearing, 
bearing-only  or  range-only  sensor  measurements.  Full-SLAM  for  autonomous  robotic  systems  is  feasi¬ 
ble  if  the  history  of  observations  is  maintained  for  a  chunk  of  time  to  enable  global  optimization  at  the 
cost  of  additional  computational  resources.  Full-SLAM  refers  to  solving  a  non-linear  optimization  prob¬ 
lem  using  the  entire  history  of  observations  to  provide  a  SLAM  estimate  for  the  autonomous  platform 
rather  than  only  the  current  state;  it  produces  better  results  than  filtering  approaches  and  can  handle  non- 
Gaussian  noise  processes.  In  full-SLAM,  the  platform  trajectory  and  map  structure  are  jointly  optimized 
given  all  measurements  up  to  the  current  time.  It  is  analogous  to  the  BA  approach  widely  used  in  the 
computer  vision  community  for  multiview  3D  reconstruction.  BA  (full-SLAM)  is  an  improved  version 
of  the  Gauss-Newton  Method  using  the  Levenberg-Marquardt  (LM)  approach  to  optimize  the  parameters 
given  some  constraints  such  as  image  projection  errors  of  3D  scene  points  [107].  Compared  to  KF-based 
SLAM,  BA  provides  more  accurate  results,  but  speed  and  computational  complexity  are  issues  in  using 
full-SLAM  for  realtime  applications.  Recent  approaches  to  improve  performance  include  the  key-frame 
technique  in  adaptive  relative  bundle  adjustment  to  increase  the  speed  of  full-SLAM  [191].  In  [  91,  146] 
a  method  called  Local  Bundle  Adjustment  (LBA)  is  proposed  which  is  able  to  perform  BA  over  partial 
clusters  of  cameras  using  an  intelligent  method  of  choosing  key-frames  based  on  the  estimated  measure¬ 
ment  uncertainties.  Relative  Bundle  Adjustment  (RBA)  is  another  approach  [191].  FrameSLAM  [113] 
was  proposed  to  match  visual  frames  with  large  numbers  of  point  features,  using  classic  BA  techniques, 
but  keep  only  relative  frame  pose  information  (a  skeleton).  The  skeleton  is  a  reduced  nonlinear  system 
that  is  a  faithful  approximation  of  the  larger  system,  and  can  be  used  to  solve  large  loop  closures  quickly, 
as  well  as  forming  a  backbone  for  data  association  and  local  registration. 

GPS-denied  SLAM:  SLAM  in  GPS -denied  situations  for  aerial  platforms  in  unknown  environments 
without  a  priori  information  is  required  for  robust  autonomous  operation  and  mission  success.  GPS-denied 
SLAM  has  application  for  operations  in  dense,  urban  environments  with  weak  GPS  signals,  interference, 
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conserving  power,  etc.  [53].  In  the  absence  of  GPS,  vision-based  electro-optical  and  infrared  cameras 
and  IMU  sensors  can  be  used  for  UAVs  [9<  ].  In  [  23],  visual  and  inertial  information  are  fused  using  a 
non-linear  optimization  approach.  They  integrate  an  IMU  error  term  with  landmark  reprojection  errors  in 
a  probabilistic  manner,  resulting  in  a  joint  non-linear  cost  function  that  is  optimized.  Using  the  concept  of 
key-frames ,  old  states  in  the  system  are  marginalized  to  maintain  a  bounded-sized  optimization  window, 
ensuring  real-time  operation.  We  will  develop  incremental  localization  estimation  using  local  bundle  ad¬ 
justment  based  on  a  small  cluster  of  views  and  using  a  priori  (partial  or  complete)  3D  models  or  landmarks 
that  have  been  measured  beforehand. 

3.10.3  Alternative  Flightpath  Trajectories 

In  addition  to  circular  orbits  for  which  the  current  3D  pipeline  and  suite  of  algorithms  has  been  developed, 
we  will  also  evaluate  the  feasibility  of  3D  reconstruction  under  different  sampling  conditions  as  well 
as  more  general  non-circular  flightpaths.  Such  flexibility  is  essential  for  several  reasons.  One  is  wider 
coverage  over  a  shorter  period  of  time  for  large  area  mapping.  Fully  circular  orbits  are  expensive  in  terms 
of  platform  time  and  require  additional  time  to  maneuver  to  the  next  circular  flightpath  for  coverage  of 
adjacent  geospatial  regions  on  the  ground.  Furthermore,  circular  flightpaths  do  not  offer  flexibility  under 
hazardous  adversarial  situations.  We  will  look  into  the  feasibility  of  using  several  flightpath  scanning 
patterns  for  3D  reconstruction  including  grid  patterns,  snake-like  zig  zag  back  and  forth  scanning  with 
simultaneous  gimbal  tracking  for  several  ground  points.  The  second  is  to  identify  the  technology  gaps  for 
3D  reconstruction  in  A2AD  environments  where  airspace  is  not  accessible  or  hazardous. 
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Dataset  spec. 

BA4S 

BA4S 

VisualSfM 

Mavmap 

Pix4D 

Speed-up 

Speed-up 

Speed-up 

Dataset 

Image  Size 

n 

nQ 

m 

Feat.+Track 

Triangulation 

Optimization 

3 

ES 

Per  Image 

3 

eS 

Per  Image 

$  S 
S  ? 

13 

1 

Per  Image 

13 

1 

Per  Image 

£  5 
+  ST 

^  £ 

DinoRing 

640x480 

48 

7K 

2K 

5s 

~0s 

0.5s 

0m+6s 

0.12s 

16s 

0.33s 

1.8x 

NA 

NA 

NA 

NA 

NA 

NA 

Fountain-Pll 

3072x2048 

11 

52K 

17K 

3s 

~0s 

2s 

0m+05s 

0.45s 

18s 

1.64s 

3.64  x 

NA 

NA 

NA 

NA 

NA 

NA 

Four  Hills,  NM 

6048x4032 

100 

263 K 

81K 

42s 

~0s 

4s 

0m  +  46s 

0.46s 

36m+00s 

21.60s 

47.0x 

34m 

20.04s 

43. 6x 

41m 

24.7s 

53.7  x 

PIXHAWK[1  ] 

752x480 

300 

254K 

46K 

10s 

~0s 

18s 

28s 

0.09s 

NA 

NA 

NA 

19m 

3.80s 

42.2x 

NA 

NA 

NA 

Columbia,  MO 

6600x4400 

202 

656K 

116K 

87s 

~0s 

14s 

lm  +  41s 

0.50s 

NA 

NA 

NA 

60m 

17.82s 

35. 6x 

3h  +  36m® 

64.2s 

128. 3x 

ABQ215,  NM 

6600x4400 

215 

668K 

142K 

93s 

~0s 

24s 

lm  +  57s 

0.54s 

4h  +  25m 

73.95s 

107. 8x 

lh  +  07m 

18.70s 

34.6x 

3h  +  32m 

59.2s 

109. 6x 

Berkeley,  CA 

6600x4400 

220 

683K 

139K 

93s 

~0s 

15s 

lm  +  48s 

0.49s 

4h  +  40m 

76.37s 

155. 9x 

lh  +  03m 

17.80s 

36.3  x 

3h  +  52m 

63.3s 

129.2x 

Los  Angeles,  CA 

6600x4400 

351 

1,1M 

207K 

141s 

~0s 

40s 

3m  +  01s 

0.51s 

8h  +  05m 

78.29s 

153. 5x 

lh  +  43m 

17.78s 

34.9  x 

8h  +  55m® 

91.45s 

179.3X 

ABQ1071,  NM 

6600x4400 

1,071 

3.5M 

603K 

440s 

9s 

251s 

1  lm  +  40s 

0.65s 

26h  +  36m 

89.41s 

137. 6x 

5h  +  21m 

17.98s 

27.7  x 

50h  +  49mb 

170.8s 

262.8  x 

Columbia-II*,  MO 

6600x4400 

5,322 

17.4M 

2.5M 

2089s 

67s 

349s 

41m  +  45s 

0.47s 

na F 

NA 

NA 

NA 

NA 

NA 

NA^ 

NA 

NA 

Columbia-II,  MO 

6600x4400 

5,322 

17.4M 

2.5M 

2307s 

68s 

916s 

54m  +  51s 

0.62s 

M v 

NA 

NA 

NA 

NA 

NA 

NAb 

NA 

NA 

A' 

d? 

G^ 

A* 

C? 

G^ 

A^ 

V* 

A 

d? 

A 

A 

d? 

A 

A 

A* 

G^ 

A 

A 

G^ 

A 

A 

(a:  Failure  in  its  first  attempt  and  had  to  be  repeated,  b:  Failed  to  generate  result.) 


Table  2:  Dataset  characteristics  and  timings  for  individual  processing  steps  (per  image)  and  the  full  BA4S  pipeline.  We  also  include  timing 
comparison  with  VisualSfM  [217]  and  two  aerial  imagery  SfM  algorithms,  Mavmap[184]  and  Pix4D[7].  Columns  n  (Col  3),  n0  (Col  4)  and  m 
(Col  5)  indicate  number  of  cameras/images,  number  of  2D  observation  points  and  number  of  3D  points  (feature  tracks),  respectively.  The  time 
for  each  stage  of  BA4S  processing  including  feature  extraction  and  tracking,  triangulation  and  optimization  is  shown.  All  of  the  experiments  in 
Table  2  include  optimization  for  extrinsic  parameters  (rotation  and  translation)  of  every  single  camera  (6xn  parameters),  all  3D  points  (3  x  m 
parameters)  and  one  shared  focal  length.  Only  "Columbia-II*"  does  not  include  focal  length  optimization.  The  total  time  taken  for  BA4S, 
VisualSFM,  Mavmap  and  Pix4D  are  shown  along  with  per  image  timings.  On  average  BA4S  (on  WAMI  datasets)  is  nearly  130  times  faster  than 
VisualSfM  (Col  13),  over  35  times  faster  than  Mavmap  (Col  16)  and  about  274  times  faster  than  Pix4D  (Col  19). 

4  Results  and  discussion 

4.1  BA4S 

In  this  section  our  developed  BA4S  SfM  pipeline  is  evaluated.  We  provide  a  set  of  quantitative  and 
qualitative  assessments  to  evaluate  the  accuracy,  speed  and  effectiveness  of  the  proposed  approach. 


4.1.1  Implementation 

The  BA4S  pipeline  was  implemented  in  C++.  The  computer  used  for  experiments  was  a  server  with  CPU 
Intel  Xeon  5650,  2.66  GHz,  12  cores,  24  GB  RAM  and  nVidia  GTX480/1.5GB  GPU.  SIFT-GPU  [214] 
was  used  for  fast  feature  extraction.  The  Ceres  Solver  library  [9]  was  used  for  non-linear  least-squares 
estimation;  Schur’s  complement,  Cholesky  factorization  and  Levenberg-Marquardt  algorithms  were  used 
for  trust  region  step  computation. 


4.1.2  Datasets 

We  used  a  collection  of  real  aerial  WAMI  datasets,  most  of  them  collected  (by  Transparent  Sky  [5]) 
using  a  fixed  wing  aircraft  with  on-board  pose  sensors  flying  over  several  different  urban  areas  including 
downtown  Albuquerque,  NM,  Four  Hills,  NM,  Columbia,  MO,  Los  Angeles,  CA  and  Berkeley,  CA.  Fig. 
38  shows  a  few  sample  frames  from  some  of  these  datasets.  The  characteristics  of  the  sequential  and 
WAMI  test  datasets  along  with  timing  results  are  given  in  Table  2.  Each  dataset  includes  platform-based 
camera  orientation  matrices  and  translation  vectors  provided  by  imprecise  IMU  and  GPS  measurements 
along  with  intrinsic  camera  parameters  that  we  refer  to  as  metadata. 

In  addition  to  sequential  aerial  imagery  datasets,  the  BA4S  pipeline  has  been  tested  on  several  publicly 
available  vision  benchmark  datasets  with  multiview  imagery  acquired  in  a  sequential  (circular)  trajectory, 
including  DinoRing  from  the  Middlebury  MVS  evaluation  project  [  86].  As  discussed  in  [82,  13],  Di- 
noRing  is  a  challenging  dataset  for  evaluating  SfM  pipelines  since  the  object  lacks  salient  features  making 
feature  tracking  very  difficult.  Another  sequential  multiview  dataset  is  Fountain-Pll  from  EPFL  [199] 
with  eleven  images  taken  from  different  side  views  of  a  wall-attached  fountain. 
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Figure  35:  Plots  for  number  of  matches  between  camera  pairs  within  the  WAMI  datasets.  As  expected  from  a  sequential  imagery,  adjacent 
cameras  share  the  largest  number  of  matches;  the  number  of  matches  decreases  as  the  temporal  distance  between  frames  increases.  The  peak 
values  are  along  the  diagonals.  One  can  see  that  there  are  large  numbers  of  matches  between  the  first  frames  and  last  ones  in  most  of  these  WAMI 
datasets,  as  a  loop  closure  was  performed. 


(a)  ABQ215-  Left:  GPS  positions.  Right:  GPS  positions  and  IMU  directions.  (b)  LA-  Left:  GPS  positions.  Right:  GPS  positions  and  IMU  directions. 


Figure  36:  Uncorrected  camera  trajectories  prior  to  applying  BA4S  corresponding  to  the  Albuquerque  (ABQ215)  and  Los  Angeles  (LA)  aerial 
WAMI  datasets.  Note  that  the  vertical  axis  of  each  graph  uses  a  different  vertical  scale  for  the  altitude  range. 
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(a)  ABQ.  215  frames;  141,559  tracks  (b)  AlbuquerqueFull.  1071  frames; 

603,119  tracks 


Track  length  histogram.  (mean=  3.26, std=  2.03) 


Track  length  histogram.  (mean=  5.66, std=  6.E 


(c)  Columbia.  202  frames;  115,897  tracks  (d)  Four  Hills.  100  frames;  80,661  tracks 


Track  length  histogram.  (mean=  4.92, std=  5.E 


(e)  Berkeley.  220  frames;  138,743  tracks  (f)  LA.  351  frames;  207,391  tracks 
Figure  37:  Distribution  of  feature  track  lengths  in  the  WAMI  datasets  follows  an  approximately  power  law  distribution. 
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Figure  38:  Each  row  shows  a  different  WAMI  sequence  collected  by  Transparent  Sky  [5]  along  with  four  frames  well  separated  in  time.  Sub¬ 
sampled  (but  uncropped)  frames  from  the  original  6600x4400  images  are  shown  except  Four  Hills  which  has  a  frame  size  of  6048x4032.  See 
Table  2  for  details  of  each  dataset  specification. 
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(a)  ABQ215 


(b)  Berkeley 


Figure  39:  Differences  in  camera  positions  between  original  noisy  metadata  and  BA4S  corrected  position  output  for  different  aerial  WAMI 
datasets  including  ABQ,  Berkeley,  LA  and  Columbia.  The  curves  show  the  amount  of  correction  applied  to  each  camera  position  after  BA4S. 


Time  Per  Frame  for  BA4S 


Number  of  cameras 

Figure  40:  Per-frame  timing  performance  for  BA4S  corresponding  to  the  results  in  Table  2.  The  time  per  frame  is  approximately  constant  for 
1000  of  more  cameras/views  which  is  very  promising  for  large  scale  SfM  applications. 

4.1.3  Results 

We  ran  the  proposed  BA4S  pipeline  on  each  dataset  in  Table  2.  The  introduced  sequential  feature  tracking 
method  was  used  to  track  the  identified  SIFT  features  over  the  sequence  (see  Table  2  for  the  statistics).  The 
histograms  of  the  track  lengths  for  six  WAMI  aerial  dataset  are  presented  in  Fig.  37.  Plots  in  Fig.  35  show 
the  amount  of  matches  (feature  point  correspondences)  between  each  two  images/cameras  for  the  same 
six  WAMI  datasets.  Notice  that  the  feature  matching  and  track  building  were  performed  sequentially  and 
the  last  frame  in  each  dataset  was  matched  against  the  first  one  to  avoid  drift  in  the  sequences  (in  robotics 
it  is  known  as  loop  closure  [79]).  A  linear  triangulation  algorithm  [9  ]  was  used  to  estimate  and  initialize 
3D  points.  The  persistency  factors  of  the  tracks  and  their  related  statistics  were  used  as  weights  in  the 
adaptive  robust  function  for  the  BA  optimization.  Fig.  36  depicts  the  camera  positions  and  their  viewing 
directions  (optical  axes)  for  two  of  the  datasets,  ABQ215  and  LA.  Fig.  39  shows  the  amount  of  camera 
position  correction  estimated  by  BA4S  for  three  different  WAMI  datasets.  The  sensor  measurement  errors 
along  the  x-  and  y-axes  directions  for  ABQ  have  the  largest  range  while  the  Berkeley  raw  measurements 
were  the  most  accurate.  Notice  that  since  the  distance  from  the  sensor  to  the  scene  objects  is  on  the 
order  of  a  kilometer  or  more,  even  small  errors  in  the  camera  pose  could  lead  to  very  large  errors  once 
projected  onto  the  scene.  For  all  the  experiments  presented  in  Table  2,  the  optimizations  were  performed 
on  6  extrinsic  parameters  (3  rotation  and  3  translation)  for  every  single  camera  (6 n),  all  3D  points  (3m) 
and  one  shared  focal  length  (total  number  of  parameters  to  be  optimized  were  6 n  +  3m  +1).  Only  for 
"Columbia-II*",  the  focal  length  was  not  considered  for  the  optimization.  We  observed  that  when  the 
focal  length  is  not  included  as  a  parameter  to  be  optimized,  we  gain  about  %30  speed  up  in  the  BA. 
This  can  be  seen  in  Table  2  by  comparing  the  timings  in  the  "Optimization"  column  (Col  8)  between 
Columbia-II*"  and  "Columbia-II".  BA4S  performance  in  terms  of  per  frame  execution  times  for  WAMI 
aerial  datasets  is  plotted  in  Fig.  40.  The  time  per  frame  is  approximately  constant  (see  Column  10  in 
Table  2)  and  is  independent  of  the  number  of  cameras  (or  views)  which  is  a  surprising  result  compared  to 
other  SfM  methods  in  the  literature  and  also  very  promising  for  large  scale  aerial  imagery  BA,  mosaicing, 
georegistration  and  3D  reconstruction  applications.  The  timing  for  BA4S  includes  I/O  for  copying  image 
frames  from  storage  to  memory  which  we  found  to  be  a  significant  percentage  of  the  Feature  Extraction 
and  Tracking  time  (Column  6  in  Table  2). 
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Figure  41:  Reprojection  error  (RMSE)  and  convergence  in  BA4S.  Legend  notation:  The  first  term  in  the  legend  is  the  dataset  name.  ’LC’  means 
that  a  ’loop  closure’  was  applied.  ’F’  is  for  the  case  with  shared  focal  length  of  the  cameras  that  is  optimized  using  an  initial  value  of  17,000 
pixels  which  is  close  to  the  true  value  of  17,651  pixels.  ’FF’  is  similar  to  the  ’F’  case,  but  using  an  initial  value  of  10,000  which  is  considerably 
far  from  the  actual  value  (more  than  50%  deviation).  As  shown  in  the  left  plot,  BA4S  managed  to  recover  and  reduce  the  RMSE  error  to  less 
than  one  pixel,  even  when  the  focal  length  was  initialized  to  almost  half  of  its  actual  value. 


Convergence:  Plots  in  Fig.  41  show  the  reprojection  error  and  convergence  of  BA4S  on  a  couple 
of  datasets.  The  reprojection  errors  (RMSE)  are  shown  in  Fig.  41-left.  The  convergence  plot  or  the 
amount  of  change  in  RMSE  between  each  iteration  is  plotted  in  Fig.  41 -middle  and  the  consumed  time 
per  iteration  is  shown  in  Fig.  41 -right.  In  some  of  these  experiments  a  loop  closure  was  performed 
(indicated  by  ’LC’  in  the  captions).  In  this  part  of  our  evaluation  (plots  of  Fig.  41),  all  of  the  experiments 
include  optimization  for  extrinsic  parameters  (rotation  and  translation)  of  every  single  camera  (6  x  n 
parameters)  and  all  3D  points  (3  x  m  parameters).  In  some  of  these  experiments  the  camera’s  focal  length 
(as  an  intrinsic  parameter)  was  also  optimized  where  the  initial  value  was  set  to  17,000  pixels,  while 
the  correct  focal  length  was  17,650  pixels.  Those  experiments  are  indicated  by  ’F’  in  their  captions  in 
Fig.  41.  In  most  of  the  cases,  RMSE  was  significantly  reduced  to  less  than  a  pixel  right  after  the  first 
iteration  in  the  non-linear  optimization.  For  two  datasets  the  focal  length  was  set  to  10,000  pixels  which 
is  far  away  from  its  actual  value  of  17,650  pixels  (about  57%  off).  They  are  indicated  by  ’FF’  in  their 
captions.  As  one  can  see  in  Fig.  41 -left,  even  when  the  focal  length  was  initialized  to  almost  half  of 
its  correct  value,  BA4S  managed  to  recover  and  reduce  the  RMSE  error  to  less  than  one  pixel  in  its  fifth 
iteration.  This  indicates  that  BA4S  is  not  sensitive  to  the  initial  focal  length  and  is  able  to  recover  in 
a  few  iterations.  Consequently,  the  need  for  applying  a  camera  calibration  procedure  (before  BA4S)  is 
relaxed.  We  compared  the  results  of  our  BA4S  with  parallel  VisualSfM/Bundler  [217],  a  state-of-the-art 
SfM  implementation,  for  some  sample  datasets;  For  VisualSfM  we  provided  imagery  without  including 
metadata,  which  is  not  a  fully  supported  feature.  VisualSFM  uses  two  strategies  for  matching  including 
preemptive  and  exhaustive  [  15].  With  preemptive  matching  VisualSFM  generated  several  fragments  of 
cameras  and  only  a  fraction  of  cameras  could  be  recovered  while  for  other  cameras  it  failed.  This  is 
consistent  with  similar  observations  about  VisualSFM’s  limited  performance  on  sequential  aerial  images 
[184].  We  ran  VisualSFM  in  its  exhaustive  and  expensive  matching  mode  in  order  to  recover  all  cameras. 
We  compared  our  algorithm  versus  Mavmap[184]  as  a  recent  SfM  algorithm  for  sequential  aerial  images. 
Like  our  method,  Mavmap  also  takes  advantage  of  temporal  consistency  and  availability  of  metadata.  The 
camera  poses  in  Mavmap  are  initialized  by  estimating  the  essential  matrix  [152]  and  applying  a  RANSAC 
technique  to  deal  with  large  amounts  of  outliers.  In  addition,  we  evaluated  our  method  against  Pix4D  [7] 
which  is  a  commercial  SfM  and  aerial  mapping  system.  In  these  experiments,  our  BA4S  algorithm  is  (on 
WAMI  aerial  imagery)  in  average  130  times  faster  than  VisualSfM,  35  times  faster  than  Mavmap,  and  274 
times  faster  than  Pix4D. 

We  also  tried  MapTK  [122],  a  recent  open-source  SfM  algorithm  designed  specifically  for  aerial  im¬ 
agery,  on  some  of  our  WAMI  datasets  including  ABQ215,  Berkeley  and  ABQ1071.  Its  feature  tracking 
(combined  with  parameter  estimation  and  triangulation)  part  ran  on  all  of  these  three  datasets,  however 
for  Berkeley  and  ABQ1071  its  optimization  algorithm  did  not  terminate  after  entering  the  BA  part  and  we 
had  to  stop  it  after  running  for  a  long  time.  In  its  successful  run  on  ABQ215,  it  took  about  22m+14s  for 
the  feature  extraction,  parameter  estimation  and  triangulation  and  7m+17s  for  the  optimization  (in  total, 
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Figure  42:  Evaluation  of  camera  parameters  using  EEE  measure  before  (a  and  c)  and  after  BA  (b  and  d)  for  the  ABQ215  (a  and  b)  and  Berkeley 
(c  and  d)  aerial  WAMI  datasets.  The  pel  in  each  matrix,  eqi ,  indicates  the  error  between  the  q- th  camera  (matrix  rows)  and  l- th  camera  (columns), 
computed  using  (35).  The  eqi  pel  values  were  truncated  to  the  maximum  values  50  and  5  respectively.  The  mean  and  standard  deviation  of  the 
errors  in  pixels  (36),  over  all  cameras  are  shown  below  each  plot.  Notice  that  each  plot  uses  a  different  scale  for  better  visualization  of  errors. 
Color  bars  shown  to  the  right. 


29m+31s)  compared  to  our  BA4S  which  took  lm+33s  for  feature  extraction  and  tracking,  and  24s  for 
BA  optimization  (in  total,  lm+57s)  on  the  same  dataset.  As  one  can  see,  despite  the  fact  that  it  is  very 
recent  and  specifically  designed  for  aerial  imagery,  and  while  being  the  fastest  SfM  (compared  to  other 
existing  approaches),  MapTK  is  still  15x  slower  than  our  proposed  BA4S  on  the  only  dataset  for  which 
it  was  able  to  run  without  failing.  We  believe  that  MapTK  like  other  SfM  approaches  is  not  able  to  truly 
use  the  available  metadata  (although  noisy)  towards  achieving  higher  speed  and  robustness.  It  is  due  to 
the  fact  that  the  pipeline  still  uses  RANSAC  and  iterative  random  based  parameter  estimations  (see  the 
conventional  pipeline  in  Fig.  3.) 

The  EEE  evaluation  metric  previously  explained  was  applied  to  both  the  ABQ215  and  Berkeley 
datasets  and  the  symmetric  error  matrix  e/m  in  (35)  is  shown  as  colored  pels  in  Fig.  42.  Plots  (a)  and 
(c)  show  the  EEE  measure  of  the  camera  parameters  using  metadata  (uncorrected  platform  camera  pa¬ 
rameters);  the  range  of  errors  is  truncated  to  50  pixels.  The  initial  raw  metadata  is  very  noisy  but  after 
refinement  using  the  proposed  BA4S  pipeline  there  is  significant  improvement  in  quality  as  shown  in  plots 
(b)  and  (d);  notice  the  range  of  errors  is  truncated  to  3  pixels  in  this  case.  The  EEE  jie  and  cr€  statistics  us¬ 
ing  all  the  cameras  (see  Equation  (36))  are  also  given.  BA4S  was  quite  successful  in  refining  the  metadata 
while  using  significantly  less  time  (Table  2).  For  MapTK,  the  EEE  (error)  pairwise  plot  and  also  its  his¬ 
togram  are  shown  in  Fig.  43.  The  mean  and  standard  deviation  of  EEE  corresponding  to  this  experiment 
are  2.31  and  2  pixels,  respectively. 

Fig.  44  shows  the  EEE  graphically  to  assess  camera  parameters  for  ABQ215  (first  two  rows)  and 
Berkeley  (last  row)  datasets.  Point  correspondence  #2  in  the  ground- truth  between  the  50th  and  150th 
cameras  within  the  sequence  were  used.  The  epipolar  lines  corresponding  to  image  #50  in  each  dataset 
is  computed  using  the  camera  parameters  and  plotted  on  image  #150.  The  left  column  shows  original 
raw  metadata  (unrefined).  The  middle  column  shows  the  epipolar  lines  after  the  metadata  were  refined 
using  the  BA4S  pipeline.  The  epipolar  lines  should  ideally  pass  through  the  ground-truth  points  (center 
of  the  marked  circles  in  each  plot).  As  can  be  seen,  the  noise  in  metadata  is  significantly  reduced  after 
applying  BA4S.  The  errors  values  in  these  plots  are  consistent  with  the  EEE  values  in  Fig.  42;  see  the 
pair  (#50, #150)  as  well  as  the  pair  (#1,#158)  in  the  matrix.  The  epipolar  lines  in  the  right  column  were 
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Figure  43:  Evaluation  of  camera  parameters  using  EEE  measure  after  using  MapTK  for  the  ABQ215  aerial  WAMI  dataset.  Left:  EEE  (error) 
plot.  The  pel  in  the  matrix,  eqi,  indicates  the  error  between  the  q- th  camera  (matrix  rows)  and  l- th  camera  (columns),  computed  using  (35).  The 
eqi  pel  values  were  truncated  to  the  maximum  value  of  3.  The  mean  and  standard  deviation  of  the  errors  in  pixels  (36),  over  all  cameras  are 
lie  =  2.31  and  ae  =  2.  Right:  Histogram  of  the  corresponding  error  values. 


(a)  ABQ215,  frames  (#50, #150).  Left:  uncorrected  metadata.  Middle:  metadata  corrected  by  BA4S.  Right:  VisualSfM 


(b)  ABQ215,  frames  (#1,#158).  Left:  uncorrected  metadata.  Middle:  metadata  corrected  by  BA4S.  Right:  VisualSfM 


(c)  Berkeley,  frames  (#50, #150).  Left:  uncorrected  metadata.  Middle:  metadata  corrected  by  BA4S.  Right:  VisualSfM 


Figure  44:  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  uncorrected  metadata,  refined  metadata  by  BA4S  and  VisualSfM. 
Some  ground-truth  points  between  two  pairs  of  cameras,  (#50, #150)  and  (#1,#158),  in  the  ABQ215  WAMI  dataset  and  a  pair  of  cameras 
(#50,#  150)  in  Berkeley  WAMI  dataset  were  used.  The  shown  images  correspond  to  the  left  camera  in  each  pair.  On  each  shown  image  the 
corresponding  ground-truth  point  is  indicated  by  red  circle.  Epipolar  lines  corresponding  to  the  ground-truth  point  from  the  right  camera  in  each 
pair  are  directly  calculated  using  camera  parameters  (using  Eq.  (28))  and  plotted  on  the  image  of  the  left  camera  in  each  pair  (shown  images).  The 
camera  parameters  from  three  different  sources  were  used  in  each  column;  left:  uncorrected  original  metadata,  middle:  BA4S  (refined  metadata) 
and  right:  VisualSFM. 
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(a)  ABQ215  dataset.  Left:  uncorrected  metadata  with  Euclidean  error  =  16.75  pixels.  Right:  Refined  metadta  by  BA4S  with  Euclidean  error  = 
0.23  pixels. 


(b)  Berkeley  dataset.  Left:  uncorrected  metadata  with  Euclidean  error  =  20.63  pixels.  Right:  Refined  metadta  by  BA4S  with  Euclidean  error  = 
0.14  pixels. 


Figure  45:  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  the  metadata  before  and  after  refinement  by  BA4S.  The  left  and 
right  (cropped)  image  in  each  row  correspond  to  the  camera  parameters  assessments  using  original  metadata  (uncorrected)  and  BA4S  optimized 
ones,  respectively.  While  Fig.  44  depicted  the  epipolar  lines  for  the  point  correspondences  between  just  two  views,  each  row  in  the  current 
figure  represents  a  similar  assessment  however  between  one  camera  and  all  other  cameras  within  the  dataset.  The  groundtruth  point  is  marked 
in  yellow.  Instead  of  drawing  all  epipolar  lines  over  an  image,  an  analogue  point  for  each  epipolar  line  is  plotted  (colored  in  red).  We  obtain  the 
line  segment  which  joins  the  ground  truth  point  to  the  epipolar  line  and  is  perpendicular  to  the  line.  Each  of  such  a  line  segment  is  used  as  an 
error  vector.  The  foot  of  the  perpendicular  corresponding  to  each  epipolar  is  plotted  as  red  points.  The  mean  and  covariance  matrix  of  the  error 
vectors  are  computed  and  the  values  (in  pixels)  are  printed  over  the  top  of  each  image  in  the  figure.  The  covariance  matrix  and  mean  are  also 
plotted  as  an  ellipse  and  point  (in  blue),  respectively.  In  these  two  samples,  we  observed  reductions  in  the  errors  by  factors  of  more  than  70  and 
140  times  respectively  for  Albuquerque  and  Berkeley  datasets. 
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(a)  VSFM-  Euclidean  Error  =  2.02  pixel. 


(b)  BA4S-  Euclidean  Error  =  0.82  pixel. 


Figure  46:  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  the  corrected  metadata  by  BA4S  and  VSFM.  Visual  assessment 
of  camera  parameters  using  epipolar  ellipses  in  ABQ215  dataset  (frame  #215).  Left:  VSFM,  right:  BA4S.  While  Fig.  44  depicted  the  epipolar 
lines  for  the  point  correspondences  between  just  two  views,  the  current  figure  represents  a  similar  assessment  however  between  one  camera  and 
all  other  cameras  within  the  dataset.  The  ground  truth  point  is  marked  in  yellow.  Instead  of  drawing  all  epipolar  lines  over  an  image,  an  analogue 
point  for  each  epipolar  line  is  plotted  (colored  in  red).  We  obtain  the  line  segment  which  joins  the  ground  truth  point  to  the  epipolar  line  and  is 
perpendicular  to  the  line.  Each  of  such  a  line  segment  is  used  as  an  error  vector.  The  foot  of  the  perpendicular  corresponding  to  each  epipolar 
is  plotted  as  red  points.  The  mean  and  covariance  matrix  of  the  error  vectors  are  computed  and  the  values  (in  pixels)  are  printed  over  the  top  of 
each  image  in  the  figure.  The  covariance  matrix  and  mean  are  also  plotted  as  an  ellipse  and  point  (in  blue),  respectively.  The  Euclidean  error 
corresponding  to  this  sample  is  2.02  and  0.82  pixels  for  VSFM  and  BA4S  respectively.  Notice  that  the  red  error  points  are  all  scattered  very  close 
to  the  ground  truth  (yellow  point)  and  therefore  may  not  be  visible,  specially  for  the  BA4S  result  as  the  error  is  very  small. 

plotted  using  the  camera  parameters  estimated  by  VisualSFM  for  comparison.  The  epipolar  line  for  a 
ground  truth  point  between  a  pair  of  cameras  (#1,#158)  is  drawn.  Similar  plots  are  presented  in  Fig. 
47  for  comparison  between  the  original  metadata,  and  the  corrected  parameters  by  BA4S  and  MapTK. 
Another  visual  assessment  is  presented  in  Fig.  45.  While  Fig.  44  depicted  the  epipolar  lines  for  the 
point  correspondences  between  just  two  views,  each  row  in  Fig.  45  presents  a  similar  assessment  however 
between  one  camera  and  all  other  cameras  in  the  dataset.  The  ground  truth  point  is  marked  in  yellow. 
Instead  of  drawing  all  epipolar  lines  over  one  image  which  would  be  very  confusing,  we  have  chosen  to 
plot  a  representative  point  for  each  epipolar  line  (colored  in  red).  We  obtain  the  perpendicular  line  segment 
on  each  epipolar  line  which  passes  through  the  corresponding  ground  truth  image  point.  The  foot  of  such  a 
perpendicular  line  on  each  epipolar  line  is  used  as  a  representative  for  the  error  between  the  two  underlying 
cameras.  Indeed,  the  length  of  such  perpendicular  line  segment  (from  the  foot  to  the  ground  truth  point) 
was  already  used  to  compute  eqi  in  (35).  In  figure  45,  the  foot  of  the  perpendicular  corresponding  to  each 
epipolar  is  plotted  in  red.  In  other  words,  the  yellow  point  shows  the  ground  truth  point  in  the  current 
frame  (view/camera)  and  each  red  point  indicates  the  error  between  the  current  frame  (view/camera)  and 
another  frame  (view/camera)  in  the  sequence.  The  mean  and  covariance  matrix  of  all  errors  are  computed 
and  the  values  (in  pixels)  are  printed  on  the  top  of  each  image  in  the  figure.  The  blue  ellipse  plotted  on 
each  image  visualizes  the  corresponding  covariance  matrix,  where  the  ellipse  center  is  shown  as  a  blue  dot. 
The  first  and  second  rows  of  Fig.  45  correspond  to  an  exemplary  frame  in  ABQ215  and  Berkeley  datasets, 
respectively.  Left  images  are  for  the  cases  where  the  original  camera  parameters  (uncorrected  metadata) 
were  used  and  the  right  images  are  for  the  cases  of  using  optimized  camera  parameters  by  BA4S.  As 
one  can  see,  the  center  of  the  covariance  ellipse  (blue  point)  for  the  BA4S  ones  nearly  coincides  with  the 
ground  truth  point  (yellow  point)  and  the  distribution  of  the  errors  is  such  trivial  that  the  ellipse  almost 
appears  as  a  point,  in  contrary  to  the  left  images  (original  metadata  ones).  Similar  plots  can  be  seen  in 
Fig.  46  for  another  frame  (#214)  for  VSFM  (left)  and  BA4S  (right).  The  Euclidean  error  corresponding 
to  this  sample  is  2.02  for  VSFM  and  and  0.82  pixels  for  BA4S.  Notice  that  the  red  error  points  are  all 
scattered  very  close  to  the  ground  truth  (yellow  point)  and  therefore  may  not  be  visible,  specially  for  the 
BA4S  result  as  the  error  is  very  small. 

Usually  a  dense  3D  reconstruction  algorithm  such  as  PMVS[84]  is  applied  after  BA  in  order  to  obtain 
a  dense  and  colored  point  cloud.  We  also  applied  PMVS  for  some  of  the  datasets  to  visually  assess 
reconstructed  point  clouds.  The  optimized  metadata  from  BA4S  is  used  as  input  to  PMVS  (or  CMVS). 


69 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


(a)  Uncorrected  metadata 


(b)  Corrected  by  BA4S 


(c)  Corrected  by  MapTK  (less  accurate) 


Figure  47:  Visual  assessment  of  camera  parameters  using  epipolar  lines  using  uncorrected  metadata,  refined  metadata  by  BA4S  and  MapTK. 
A  ground-truth  point  between  a  pair  of  cameras  (#1,#158)  in  the  ABQ215  WAMI  dataset  is  used.  The  shown  images  correspond  to  the  left 
camera  in  each  pair.  On  each  shown  image  the  corresponding  ground-truth  point  is  indicated  by  red  circle.  Epipolar  lines  corresponding  to  the 
ground-truth  point  from  the  right  camera  in  each  pair  are  directly  calculated  using  camera  parameters  (using  Eq.  (28))  and  plotted  on  the  image 
of  the  left  camera  in  each  pair  (shown  images).  The  camera  parameters  from  three  different  sources  were  used  in  each  column;  left:  uncorrected 
original  metadata,  middle:  BA4S  (refined  metadata)  and  right:  MapTK  [122].  As  one  can  see  there  are  some  pixel  errors  between  the  ground 
truth  point  and  corresponding  epipolar  line  computed  from  MapTK  result,  which  is  consistent  with  the  errors  in  plot  43 -left  in  the  region  for  the 
corresponding  pair  (between  the  pair  #1,#158) 

Figs  48-a,  -b,  -c  and  -d  show  the  PMVS  dense  point  clouds  for  ABQ215,  Los  Angeles,  Berkeley  and 
Columbia,  respectively. 

Non- WAMI  datasets:  In  addition  to  testing  BA4S  on  aerial  WAMI  datasets,  we  have  applied  it  to  the 
Middlebury  benchmark  datasets  for  multiview  3D  reconstruction  which  are  not  WAMI  but  the  images  are 
acquired  sequentially.  DinoRing  [186]  is  one  of  the  challenging  datasets  for  a  classical  BA  due  to  the  lack 
of  salient  image  features  for  tracking  across  views  [82,  1  ].  The  proposed  BA4S  pipeline  was  tested  on 
this  dataset  to  evaluate  its  applicability  and  performance  for  non- WAMI  trajectory  images.  The  camera 
parameters  in  the  Middlebury  ground-truth  were  synthetically  perturbed  for  both  rotation  and  translation. 
The  perturbed  camera  parameters  along  with  the  images  were  input  to  the  BA4S  pipeline.  The  DinoRing 
dataset  has  48  cameras  with  image  resolution  of  640  x  480  pixels.  The  position  errors  for  the  metadata 
(perturbed  camera  parameters  before  optimization)  and  the  optimized  ones  by  BA4S  are  plotted  in  Fig. 
49a.  The  first  three  rows  in  Fig.  50  show  the  visual  assessment  of  three  point  correspondences.  The  errors 
for  the  corresponding  epipolar  lines  are  significantly  reduced  after  BA4S  optimized  camera  parameters. 
The  epipolar  lines  have  very  large  errors  (the  first  and  second  columns  of  each  row)  when  the  noisy  camera 
parameters  are  used.  A  dense  version  of  the  point  cloud  using  PMVS  with  BA4S  optimized  camera 
parameters  is  shown  in  Fig.  51a.  FountainPll  [199]  is  another  non- WAMI  dataset.  As  with  the  DinoRing 
dataset  the  ground-truth  camera  parameters  were  perturbed  prior  to  running  the  BA4S  algorithm.  There 
are  1 1  cameras  and  each  image  has  a  resolution  of  3072  x  2048  pixels.  The  position  errors  for  the  metadata 
(perturbed  camera  parameters)  and  the  refined  results  using  BA4S  are  plotted  in  Fig.  49b.  The  fourth,  fifth 
and  sixth  rows  in  Fig.  50,  each  shows  a  point  correspondences  in  this  dataset.  The  initial  epipolar  lines 
have  very  large  errors  in  the  views  (the  first  and  second  columns  of  each  row)  when  the  perturbed  camera 
parameters  were  used.  The  errors  for  the  corresponding  epipolar  lines  become  significantly  smaller  once 
BA4S  refined  camera  parameters  are  used.  A  dense  version  of  the  point  cloud  for  FountainPll  using 
PMVS  and  optimized  camera  parameters  is  shown  in  Fig.  51b. 

4.2  Georegistration  and  stabilization 

Refined  camera  parameters  by  BA4S  were  used  to  compute  homography  transformation  matrices  from 
each  camera  image  plane  onto  the  ground  plane.  Notice  that  such  homographies  were  directly  (analyti¬ 
cally)  computed  using  (25),  as  opposed  to  applying  an  image-to-image  homography  estimation  method. 
Fig.  52  shows  cropped  areas  of  two  registered  images.  Images  of  the  first  row  correspond  to  a  case 
where  the  original  uncorrected  metadata  were  used  for  registration.  The  second  row  shows  the  same 
frames,  however  registered  by  corrected  metadata  using  the  proposed  pipeline.  For  each  case,  a  few  cor¬ 
responding  points  (lying  in  the  ground  plane)  between  the  two  registered  frames  were  manually  picked 
and  marked.  As  shown,  the  RMS  error  for  the  case  where  the  uncorrected  metadata  were  used  is  about  70 
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(a)  Albuquerque,  NM  downtown 


(b)  Los  Angeles,  CA 


(c)  Berkeley,  CA  (d)  Columbia,  MO 


Figure  48:  Dense  3D  point  clouds  for  four  aerial  WAMI  sequences  using  CMVS  [i  ]  with  BA4S  corrected  camera  pose  metadata.  Source 
imagery  from  Transparent  Sky. 
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Figure  49:  Position  errors  before  (red  curve)  and  after  (green  curve)  optimization  using  BA4S  to  refine  the  metadata  for  DinoRing  (top)  and 
Fountain-Pll  (bottom)  datasets. 
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(a)  Metadata  (b)  Metadata-zoomed  (c)  After  BA4S  (d)  BA4S-zoomed 


(e)  Metadata  (f)  Metadata-zoomed  (g)  After  BA4S  (h)  BA4S-zoomed 


(i)  Metadata  (j)  Metadata-zoomed  (k)  After  BA4S  (1)  BA4S-zoomed 


(m)  Metadata  (n)  Metadata-zoomed  (o)  After  BA4S  (p)  BA4S-zoomed 


(q)  Metadata  (r)  Metadata-zoomed  (s)  After  BA4S  (t)  BA4S-zoomed 


Figure  50:  Visual  assessment  of  camera  parameters  using  epipolar  lines  in  DinoRing  (first  three  rows)  and  Fountain-Pi  1  (last  three  rows)  datasets. 
Each  row  shows  the  results  for  one  correspondence.  First  and  second  columns  are  using  the  original  (perturbed)  camera  metadata  parameters. 
Notice  that  in  (b),  (f),  (n),  (r)  and  (v),  the  epipolar  line  went  off  the  image  plane  due  to  large  metadata  error.  Third  and  fourth  columns  of  each 
row  show  the  full  and  zoomed  views  using  camera  parameters  after  BA4S. 
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Figure  5 1 :  Dense  reconstruction  using  PM  VS  after  optimized  camera  parameters  are  estimated  using  BA4S.  Left:  DinoRing,  right:  FountainPl  1 . 


pixels.  However,  using  the  proposed  pipeline  this  error  was  reduced  to  about  1.1  pixels. 

Fig.  53  shows  the  overlays  of  the  same  registered  frames.  The  overlays  corresponding  to  the  registra¬ 
tion  using  the  uncorrected  metadata  (plots  on  the  left  column)  illustrate  a  mixed-up  registration.  However 
the  overlays  of  registration  using  corrected  metadata  by  BA4S  (plots  on  the  right  column)  demonstrate 
consistency  between  the  results.  As  can  be  seen  in  the  plots  on  the  right  column,  only  buildings  were 
wobbled  which  is  due  to  the  parallax  phenomenon  explained  in  Sec  3.2.  The  points  lying  on  the  ground 
plane  from  two  different  views  lined  up  perfectly  which  is  the  result  of  a  precise  registration  by  our  pro¬ 
posed  method.  Unlike  the  image-to-image  registration  method  in  [130],  in  which  the  aerial  images  within 
each  dataset  could  not  be  registered  altogether  (they  were  broken  to  several  segments,  see  Table-I  in  [130]), 
our  proposed  method  is  able  to  efficiently  register  all  the  frames  together  with  no  fragmentations,  thanks 
to  the  proposed  robust  optimization  method. 

4.3  Vehicle  Detection  using  Semantic  Depth  Map  Fusion 

In  this  Section  we  elaborate  and  evaluate  our  proposed  vehicle  moving  object  detection  results  for  ABQ 
aerial  urban  imagery  which  were  collected  by  TransparentSky  [5]  using  an  aircraft  with  on-board  pose 
sensors  flying  over  downtown  Albuquerque,  NM.  Figure  21  shows  samples  of  raw  and  geo-registered 
ultra  high  resolution  images  using  BA4S  state-of-the-art  registration  approach  which  processes  them  in 
very  short  amount  of  time  (in  this  case  less  than  12  minutes  for  1071  images).  We  worked  on  a  cropped 
2k  x  2k  area  of  interest  (AOI)  for  which  the  ground-truth  are  provided  by  Kitware  (Fig.  54). 

4.3.1  Moving  Object  Segmentation 

The  first  input  of  our  fusion  scheme  is  the  trace  of  the  flux  tensor  matrix  which  provides  information  about 
temporal  gradient  changes  or  moving  edges.  Figure  54  shows  the  original  cropped  AOI  and  the  flux  trace 
results,  every  pixel  is  classified  as  moving  versus  stationary  by  thresholding  trace  of  the  corresponding 
flux  tensor  matrix.  However,  approximately  70%  of  the  motion  detection  mask  are  in  fact  false  detections 
which  are  technically  posed  by  parallax  motion  affects  of  tall  structures  (Fig.  54). 

We  fused  the  altitude  information  of  tall  structure  to  the  flux  mask  information  in  order  to  filter  out 
the  false  detection  caused  by  parallax  motion  effect  of  buildings.  Figure  55  presents  the  fusion  results  of 
flux  tensor  motion  detection  and  building  mask.  In  the  first  raw,  flux  tensor  motion  detections  are  shown 
in  2  colors;  false  detections  due  to  parallax  are  shown  in  red  color  and  the  rest  are  in  yellow.  In  order 
to  visually  evaluate  the  results  Ground  truth  bounding  boxes  are  overlayed  on  flux  mask  in  green  color. 
Altitude  mask  corresponding  to  the  ortho-rectified  images  are  shown  in  second  raw.  all  the  pixels  with 
altitude  value  greater  than  20  are  considered  as  tall  structure  and  are  shown  as  blue  mask  in  the  third  raw. 
As  discussed  in  Section  3.8.6  level-set  based  geodesic  active  contours  is  used  to  improve  the  building 
mask  and  reveal  the  filtered  moving  vehicles  next  to  the  buildings.  Improved  building  mask  contours  and 
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(a)  Original  metadata-  Left:  Frame  #0.  Right:  Frame  #60 


(b)  Corrected  by  BA4S-  Left:  Frame  #0.  Right:  Frame  #60 


Figure  52:  Registration  of  frames  #0  and  #60  in  ABQ  WAMI  dataset  for  two  cases  of  uncorrected  metadata  and  corrected  by  the  proposed  BA4S. 
For  illustration,  eight  points  within  these  cropped  registered  images  were  manually  tracked.  The  green  markers  indicate  the  correct  position  of 
the  points  (ground  truth)  and  the  red  ones  indicate  the  corresponding  points  from  the  other  frame  after  overlay.  The  average  RMS  error  for  the 
registered  frames  using  the  original  metadata  (uncorrected)  and  using  corrected  metadata  (by  BA4S)  are  about  70  and  1.1  pixels,  respectively. 
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(a)  Avg  blending-  Left:  uncorrected  metadata.  Right:  corrected  by  BA4S 


(b)  Purple-green  blending-  Left:  uncorrected  metadata.  Right:  corrected  by  BA4S 


(c)  Checkerboard  overlay-  Left:  uncorrected  metadata.  Right:  corrected  by  BA4S 


Figure  53:  Overlays  of  two  registered  frames  (#0  and  #60)  in  ABQ  WAMI  dataset  for  two  cases  of  uncorrected  metadata  and  corrected  by  the 
proposed  BA4S.  The  corresponding  registered  frames  were  already  shown  in  Fig.  52.  As  depicted  on  the  right  plots,  the  points  lying  on  the 
ground  plane  from  the  two  different  frames  are  precisely  coincided  after  BA4S  registration.  Pixels  off  the  ground  plane  like  the  buildings  are 
wobbled  due  to  the  parallax  effects  explained  in  Fig.  7. 
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Figure  54:  Illustration  of  motion  detection  using  Flux  trace  only:  From  left  to  right,  cropped  ROI  of  Albuquerque  aerial  imagery  (fr  100X  the 
spatio-temporal  motion  information  computed  by  flux  trace  for  the  selected  image,  flux  mask  in  which  each  pixel  is  classified  as  moving  or 
stationary  by  thresholding  the  flux  trace.  Morphology  is  applied  to  improve  the  result. 


Figure  55:  Illustration  of  motion  detection  using  Flux  trace  and  Depth 


final  motion  detection  results  are  shown  in  Figure  56. 


4.3.2  Evaluation  Methodology 

Since  the  ultimate  goal  of  the  proposed  motion  detection  system  is  to  enable  persistent  tracking  of  mov¬ 
ing  vehicles,  we  have  evaluated  detection  performance  using  object  level  measures.  Associations  of  the 
detected  moving  blobs  to  ground  truth  objects  is  done  using  a  bidirectional  correspondence  analysis  de¬ 
scribed  in  [43,  89])  that  handles  not  only  one-to-one  matches  but  also  merge  and  fragmentation  cases. 
Precision  and  Recall  are  computed  at  each  stage  of  fusion  as 


7—)  .  .  N TrueDetection 

Precision  = - 

\TP\ 

(73) 

■L' Detection 

~  \TP\  +  \FP\ 

71  N TrueDetection 

Recall  =  - = 

\TP\ 

(74) 

Ngt 

\TP\  +  \FN\ 

where  N TrueDetection  or  TP  is  defined  as  total  number  of  true  one-to-one  individual  matches  plus  the 
number  of  Ground  Truth  fragmented  objects  and  the  number  of  merged  detected  objects.  N Detection  is 
the  cardinality  of  detected  objects  and  Nqt  is  total  number  of  moving  object  bounding  box  presented  in 
Ground  Truth.  We  report  Fmeasure  to  evaluate  the  harmonic  mean  of  recall  and  precision  as  well. 


Precision  x  Recall 

T  —  measure  =  2  x  — - — - — 

Precision  +  Recall 


(75) 


Figure  57  and  58  report  the  object  level  performance  evaluation  results. 
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Figure  56:  Illustration  of  motion  detection  using  Flux  trace+Depth+GAC 


Object-level  Precision/Recall  Evaluation  for  ABQ(200  frames) 


Figure  57:  Object-level  Precision  and  Recall:  Performance  evaluation  of  our  proposed  fused  motion  detection  method 


Object-level  Fmeasure  Evaluation  for  ABQ(200  frames) 


Figure  58:  Object-Level  ^measure'  Performance  evaluation  of  our  proposed  fused  motion  detection  method 

4.4  CS-LoFT  Tracking 

We  have  evaluated  the  CS-LoFT  tracker  on  VOT2015  benchmark  dataset  [11  ],  comparing  against  the 
original  LoFT  tracker.  VOT  challenge  performance  measures  accuracy  and  robustness  were  used  for  eval¬ 
uation.  Accuracy  measures  how  well  the  bounding  box  predicted  by  the  tracker  overlaps  with  the  ground 
truth  bounding  box.  Robustness  measures  how  many  times  the  tracker  loses  the  target  during  tracking.  A 
failure  is  indicated  when  the  overlap  measure  becomes  zero  [116].  Figures  59  and  60  shows  intermediate 
results  related  to  color  and  scale  processing  in  CS-LoFT.  CS-LoFT  switches  to  color  tracking  mode  only 
when  the  tracked  objects  and  their  surroundings  have  different  color  features  (Figure  59a),  relies  on  in¬ 
tensity  otherwise  (Figure  59b).  Figure  60  shows  how  much  the  proposed  frame  level  max-pooling  scale 
selection  method  preserves  spatial  context  while  pixel-wise  max  pooling  over  scales  generates  distractors. 
Table  3  and  Figure  61  show  detailed  evaluation  of  LoFT  and  CS-LoFT  on  60  video  sequences  of  VOT2015 
dataset.  Table  3  groups  the  sequences  into  three  sections  according  to  their  robustness  measurement  on 
CS-LoFT  as  improved,  same,  and  worst.  Figure  62  compares  the  performance  of  CN,  our  extended  CN, 
LoFT,  and  the  proposed  CS-LoFT  trackers  in  terms  of  robustness. 

CS-LoFT  improves  the  performance  of  the  original  LoFT  tracker  in  terms  of  both  accuracy  and  ro¬ 
bustness  by  11%  and  23%  respectively.  Performance  is  particularly  improved  for  the  sequences  that  have 
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(a) 


(b) 


Figure  59:  Color  versus  intensity,  (a)  Tracked  objects  and  their  surroundings  have  different  color  features;  (b)  Tracked 
objects  and  their  surroundings  have  similar  color  but  different  intensity  features.  Columns  left  to  right:  original 
image,  likelihood  map  obtained  using  color  names  (CN)  feature,  likelihood  map  obtained  using  intensity  feature. 
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Figure  60:  Frame  level  max-pooling  best  scale  selection  versus  pixel-based  max-pooling.  The  third  column:  the  frame  level  jC*  over  all  scales, 
while  the  sixth  column:  Cmax  pixel-based  max-pooling  result. 

significant  color  distractor  and  rapid  scale  changes. 

4.5  KC-LoFT  Tracking 

Benchmark  datasets  and  challenges  are  important  for  fair  evaluation  and  comparison  of  trackers.  Since 
2013,  VOT  challenge  group  [1]  has  been  organizing  single  object  tracking  challenges  for  selected  full 
motion  video  datasets.  Table  4  shows  VOT2015[116]  and  VOT2016[115]  ranks  of  LoFT  and  some  other 
state-of-the-art  trackers.  All  the  listed  trackers  perform  better  than  LoFT  on  these  full  motion  videos 
datasets.  However,  the  nature  of  benchmark  datasets  can  be  one  of  the  most  important  factors  in  tracker 
evaluation.  LoFT  and  KC-LoFT  are  trackers  developed  for  aerial  wide  area  motion  imagery.  Aerial 
wide  area  motion  imaginary  (WAMI)  datasets  have  different  characteristics  and  challenges  compared  to 
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(a) 


(b) 

Figure  61:  Performance  evaluation  on  VOT2015  dataset:  (a)  robustness,  (b)  accuracy.  Sequences  reordered  by  robust¬ 
ness  and  accuracy  values  respectively. 


CN  CN_Ex  LoFT  CS-LoFT 


Figure  62:  Robustness  comparison  between  CN,  extended  CN,  LoFT,  and  CS-LoFT  trackers. 

the  regular  full  motion  videos.  These  WAMI  videos  suffers  from  extreme  camera  motion,  low  frame 
rate,  frequent  object  deformations,  rapid  scale  and  appearance  changes,  small  object  sizes,  shadow  and 
illumination  artifacts,  partial  and  full  occlusions. 

For  this  project,  we  have  tested  and  evaluated  the  KC-LoFT  tracker  and  the  selected  state-of-the-art 
trackers  on  the  ABQ  aerial  wide  area  motion  imagery  dataset[  ]  for  downtown  Albuquerque,  NM  Fig¬ 
ure  63.  We  have  manually  generated  ground-truth  tracks  for  136  cars  within  a  200  frames  subset  of  the 
ABQ  dataset.  A  frame  from  the  ABQ  dataset  and  zoomed  views  of  three  sample  cars  from  the  scene  are 
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Table  3:  LoFT  and  CS-LoFT  robustness  and  accuracy  comparison  ordered  form  most  sequences  that  improved  to  less  improved  in  robustness 
measurement. 


No. 

Sequences 

LoFT 

|  CS-LoFT 

LoFT  | 

CS-LoFT 

Robustness 

Accuracy 

1 

iceskaterl 

27 

10 

0.35 

0.40 

2 

book 

14 

4 

0.23 

0.26 

3 

soccer 1 

17 

7 

0.45 

0.49 

4 

iceskater2 

14 

5 

0.35 

0.41 

5 

boltl 

18 

13 

0.23 

0.24 

6 

bolt2 

6 

3 

0.23 

0.25 

7 

hand 

13 

10 

0.33 

0.39 

8 

handball  1 

6 

3 

0.34 

0.39 

9 

handball2 

8 

5 

0.34 

0.39 

10 

pedestrian  1 

6 

3 

0.38 

0.44 

11 

balll 

6 

4 

0.11 

0.14 

12 

dinosaur 

4 

2 

0.26 

0.28 

13 

femando 

4 

2 

0.26 

0.29 

14 

fish4 

5 

3 

0.27 

0.34 

15 

gymnastics  1 

16 

14 

0.32 

0.37 

16 

leaves 

4 

2 

0.35 

0.41 

17 

shaking 

6 

4 

0.40 

0.46 

18 

singer2 

4 

2 

0.42 

0.46 

19 

bag 

1 

0 

0.06 

0.10 

20 

fish2 

6 

5 

0.27 

0.31 

21 

gymnastics2 

7 

6 

0.32 

0.38 

22 

gymnastics4 

6 

5 

0.33 

0.38 

23 

motocrossl 

6 

5 

0.37 

0.42 

24 

motocross2 

4 

3 

0.37 

0.42 

25 

rabbit 

8 

7 

0.39 

0.45 

26 

soldier 

2 

1 

0.47 

0.50 

27 

sphere 

2 

1 

0.49 

0.53 

28 

tunnel 

2 

1 

0.63 

0.61 

29 

ball2 

2 

2 

0.16 

0.15 

30 

birds2 

1 

1 

0.18 

0.22 

31 

blanket 

2 

2 

0.21 

0.22 

32 

crossing 

0 

0 

0.26 

0.28 

33 

fishl 

4 

4 

0.26 

0.30 

34 

godfather 

1 

1 

0.31 

0.37 

35 

graduate 

11 

11 

0.32 

0.37 

36 

gymnastics3 

5 

5 

0.32 

0.38 

37 

helicopter 

1 

1 

0.34 

0.40 

38 

nature 

7 

7 

0.38 

0.42 

39 

racing 

0 

0 

0.40 

0.45 

40 

road 

7 

7 

0.40 

0.45 

41 

sheep 

2 

2 

0.40 

0.46 

42 

singer3 

1 

1 

0.44 

0.47 

43 

soccer2 

5 

5 

0.46 

0.50 

44 

tiger 

8 

8 

0.50 

0.54 

45 

traffic 

1 

1 

0.51 

0.56 

46 

wiper 

1 

1 

0.65 

0.61 

47 

birds  1 

9 

10 

0.17 

0.17 

48 

bmx 

0 

1 

0.22 

0.22 

49 

carl 

0 

1 

0.24 

0.27 

50 

car2 

0 

1 

0.24 

0.27 

51 

girl 

8 

9 

0.28 

0.36 

52 

glove 

2 

3 

0.31 

0.36 

53 

marching 

2 

3 

0.36 

0.42 

54 

octopus 

0 

1 

0.38 

0.44 

55 

pedestrian2 

1 

2 

0.39 

0.44 

56 

singer  1 

0 

1 

0.42 

0.46 

57 

butterfly 

3 

5 

0.23 

0.26 

58 

fish3 

1 

3 

0.27 

0.33 

59 

matrix 

9 

11 

0.36 

0.42 

60 

basketball 

6 

9 

0.17 

0.15 

Avg. 

5.37 

4.15 

0.33 

0.37 

shown  in  figure  64.  For  tracker  performance  evaluation,  we  have  used  two  VOT  challenge  measures  Ac¬ 
curacy  and  Robustness.  Accuracy  measures  how  well  the  bounding  box  predicted  by  a  tracker  overlaps 
with  the  ground  truth  bounding  box.  Robustness  measures  number  of  times  a  tracker  looses  the  target  dur¬ 
ing  tracking.  Table  5  summarizes  the  tracking  performances  for  the  proposed  KC-LoFT  tracker  and  other 
state-of-the-art  trackers  on  ABQ  dataset.  KC-LoFT  increases  both  accuracy  and  robustness  of  the  original 
LoFT  tracker  (by  9.6%  and  5.1%  respectively)  and  produces  better  or  comparable  results  compared  to  the 
state-of-the-art  trackers  from  the  VOT2015[1 16]  and  VOT2016[1 1  ]  challenges.  Figure  65  shows  sample 
tracking  results  for  two  cars.  Trajectory  color  represents  number  of  reinitializations  after  tracker  failures. 
Lower  number  of  trajectory  colors  indicates  tracker  robustness.  In  both  cases  KC-LoFT  tracks  the  selected 
cars  without  any  failures  or  restarts,  while  for  the  first  car  LoFT,  DSST,  and  Staple  require  one  or  more 
restarts,  and  for  the  second  car  all  the  trackers  except  KC-LoFT  require  one  or  more  restarts. 


4.6  Multi-object  Tracking 

We  have  tested  and  evaluated  our  MOT  tracker  on  UA-DETRAC-test  multi-object  tracking  benchmark 
dataset  [211]  consisting  of  40  challenging  real-world  traffic  videos.  We  have  evaluated  the  performance  of 
our  tracker  using  the  UA-DETRAC  evaluation  toolkit.  The  evaluation  process  includes  the  following  met- 
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Figure  63:  Sample  KC-LoFT  tracking  results  on  ABQ  aerial  surveillance  dataset. 


Table  4:  VOT2015  and  VOT2016  ranks  of  the  best  non-deep  learning  trackers  whose  codes  were  made  public.  Lower  rank  is  better.  means 
the  tracker  did  not  participate  to  the  challenge. 


Tracker  Name 

Rank  on  VOT2015 

Rank  on  VOT2016 

LoFT 

#60 

#68 

STAPLE[34] 

- 

#5 

SAMF  [126] 

#8 

#24 

DSST  [6i  ] 

#38 

#43 

SRDCF  [6  ] 

#4 

#17 

Table  5:  Tracker  performance  comparison  on  ABQ  wide  area  motion  imagery  dataset.  Bold:  best  result,  underlined:  second  best  result. 


Tracker  name 

Accuracy^ 

Robustness! 

SAMF[126] 

0.6396 

0.9197 

DSST  [60] 

0.6167 

1.7810 

SRDCF[61] 

0.4875 

0.7007 

STAPLE[34] 

0.4972 

1.1824 

LoFT 

0.5413 

0.8540 

KC-LoFT 

0.5938 

0.8102 

rics  described  in  [  ,11]:  mostly  track  (PR-MT),  mostly  lost  (PR-ML),  identity  switches  (PR-IDS),  track 
fragmentation  (PR-FRAG),  false  positives  (PR-FP),  false  negatives  (PR-FN),  multi-object  tracking  pre¬ 
cision  (PR-MOTP),  and  multi-object  tracking  accuracy  (PR-MOTA).  Accuracy  of  the  obtained  tracks  are 
best  reflected  by  combination  of  the  metrics  PR-IDS  and  PR-FRAG.  Final  evaluation  measures  are  com¬ 
puted  by  applying  the  trackers  on  a  set  of  object  detection  results  obtained  by  varying  detection  threshold, 
corresponding  to  different  detection  precision  and  recall  levels. 

We  have  combined  our  tracker  on  four  different  state-of-the-art  object  detectors,  Comp  ACT  [46], 
R-CNN  [85],  ACF  [67],  and  DPM  [73];  and  compared  our  tracking  results  to  six  state-of-the-art  MOT 
trackers,  including  GOG  [173],  CEM  [25],  DCT  [26],  IHTLS  [65],  H2T  [212],  and  CMOT  [29].  Ta- 
ble  6  summarizes  the  tracking  performances  of  our  tracker  and  other  state-of-the-art  trackers  (described  in 
[21 1]).  Our  tracker  outperforms  the  other  trackers  in  term  of  PR-MOTA,  PR-MOTP,  PR-MT,  PR-ML, 
and  PR-FN;  produces  comparable  results  for  PR-FRAG,  PR-IDS,  and  results  in  second  highest  score  in 
terms  of  PR-MOTA,  PR-ML,  and  PR-FP. 

We  have  analyzed  the  contribution  of  different  components  to  the  overall  performance  of  our  tracker  by 
systematically  enabling/disabling  different  components  (Figure  66).  Our  tests  show  that  all  components 


82 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


Figure  64:  Sample  car  templates  from  an  ABQ  dataset  frame. 


Table  6:  Tracker  performance  comparison  using  four  state-of-the-art  object  detectors  as  input.  Last  raw  indicates  whether  the  metric  for  the 
specific  column  is  better  when  high  or  low,  +  and  —  respectively.  Best  performance  for  each  metric  are  marked  in  bold.  We  have  two  entries 
for  our  tracker  based  on  how  the  performances  scores  are  averaged.  Challenge  reports  results  in  two  groups  corresponding  to  Easy  versus 
(Medium  +  Hard)  videos.  Proposed-A  computes  average  score  as  ( Score  Easy  +  ScoreMed+Hard)  /  2.  Proposed-B  computes  average  score 
as  (0.25  x  ScoreEasy  +  0-75  x  S core Med+ Hard)  based  on  number  of  video  sequences  in  each  group.  Scores  are  computed  and  averaged 
for  all  four  object  detectors  according  to  [211]. 


Trackers 

PR-MOTA 

PR-MOTAP 

PR-MT 

PR-ML 

PR-IDS 

PR-FRAG 

PR-FP 

PR-FN 

GOG  [173] 

10.1 

35.3 

10.9 

22.5 

4248.5 

4137.3 

43657.6 

199926.8 

CMOT  [  )] 

7.0 

34.6 

12.8 

21.2 

414.3 

1817.5 

79577.3 

187508.8 

H'T  [212] 

7.8 

34.6 

11.2 

22.2 

1298.9 

1477.4 

65275.3 

196107.0 

DCT  [  S] 

8.3 

35.6 

5.5 

32.3 

270.1 

261.4 

17658.2 

241590.8 

IHTLS  [65] 

5.8 

35.1 

9.6 

23.4 

1329.2 

4597.1 

68635.0 

205649.3 

CEM  [25] 

3.9 

33.6 

2.4 

36.1 

394.3 

529.0 

19044.8 

267699 

SCTrack(Ours)-A 

12.8 

37.7 

12.9 

21.9 

494.9 

1408.0 

27581.0 

99642.3 

SCTrack(Ours)-B 

9.8 

35.3 

10.96875 

22.637 

656.48 

1408.08 

35945.59 

130496.6 

Better 

+ 

+ 

+ 

- 

- 

- 

- 

- 

contribute  to  better  accumulated  performance.  For  example,  in  Experiment  —  E ,  disabling  the  CN- 
to-CN  color  correlation  weight  matrix  Wcn  and  assuming  uncorrelated  color  codes,  results  in  a  drop  in 
PR-MOTA  metric  from  13. 1  to  12.7.  Figure  67  shows  performance  improvement  in  the  tracker  by  addition 
of  each  component. 

Table  7  summarizes  frame  rates  (in  terms  of  frames  per  seconds)  for  our  tracker  and  other  state-of- 
the-art  trackers  provided  in  [211].  Frame  rate  for  our  tracker  has  been  obtained  averaging  frame  rates  on 
40  sequences  of  UA-DETRAC-test  set.  During  the  tests,  Comp  ACT  object  detection  results  have  been 
used  as  input.  Our  simple  feature  set  combined  with  two-step  distance-only  local  and  distance  appearance 
combined  global  data  association  scheme  allows  high  frame  rates  while  still  preserving  reliability  and 
accuracy  of  our  tracker. 
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First  start 


Second  Failure 


Third  Failure 


Fourth  Failure 


Fifth  Failure 


LoFT 


KC-toFT 


SAMF 


SRDCF 


D5ST 


Staple 


Car  #3  Fr# 


SAMF  SRDCF 


Figure  65:  Comparison  of  tracking  results  in  terms  of  robustness.  Trajectory  color  represents  number  of  reinitializations  after  tracker  failures. 
Lower  number  of  trajectory  colors  indicates  higher  tracker  robustness. 

5  Conclusions 

In  this  effort  we  have  worked  on  accomplishing  the  goals  of  the  project  which  was  to  push  the  computation 
closer  to  the  sensor.  In  this  direction  we  developed  novel  and  effective  methods  for  feature  detection  and 
matching  techniques  along  the  video  sequence,  structure-from-motion  and  bundle  adjustment  for  high 
resolution  aerial  imagery,  3D  reconstruction  algorithm  to  run  on  metropolitan- scale  data,  analytical  image 
stabilization  and  georegistration,  and  moving  object  detection  and  tracking  algorithms. 


Table  7:  Frame  rate  comparison. 


Trackers 

Code 

Frame  Rate  (fps) 

CEM 

Matlab 

4.62 

GOG 

Matlab 

389.51 

DCT 

Matlab,  C++ 

2.19 

IHTL 

Matlab 

19.79 

H2T 

C++ 

3.02 

CMOT 

Matlab 

3.79 

SCTrack(Ours) 

Matlab 

362.00 
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Figure  66:  Contribution  of  each  tracker  component.  Left:  PR-MOTA  metric  for  each  experiment  compared  to  the  Full  Mode  (higher  is  better). 
Right:  description  of  each  experiment. 


Figure  67:  Contribution  of  each  tracker  component  on  PR-IDS  and  PR-FRAG  metrics  (lower  is  better). 


85 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


6  LIST  OF  SYMBOLS  ABBREVIATIONS  AND  ACRONYMS 


•  Wide  Area  Motion  Imagery  (WAMI) 

•  Bundle  Adjustment  (BA) 

•  Bundle  Adjustment  for  Sequential  Imagery  (BA4S) 

•  Structure  from  Motion  (SfM) 

•  Likelihood  of  Features  Tracking  (LOFT) 

•  Kernelized  Correlation  Extended  LoFT  for  Single  Object  Tracking  (KC-LoFT) 

•  ICCV  Video  Object  Tracking  Challenge  (VOTC) 

•  Ground  sampling  distance  (GSD) 

•  Levenberg-Marquardt  (LM) 

•  Unmanned  Aerial  Vehicle  (UAV) 

•  Epipolar  Transformation  Visualization  Tool  (EpiX) 

•  Simultaneous  Localization  And  Mapping  (SLAM) 

•  Kalman  Filter  (KF) 

•  Extended  Kalman  Filter  (EKF) 

•  Unscented  Kalman  Filter  (UKF) 

•  Sequential  Monte-Carlo  (SMC) 

•  Maximum  Likelihood  (ML) 

•  Expectation  Maximization  (EM) 

•  Bilateral  Filter  (BLF) 
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